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Abstract 


This report is concerned with the Mondrian process [l] and its applications 
in machine learning. The Mondrian process is a guillotine-partition-valued 
stochastic process that possesses an elegant self-consistency property. The 
first part of the report uses simple concepts from applied probability to 
define the Mondrian process and explore its properties. 

The Mondrian process has been used as the main building block of a clever 
online random forest classification algorithm that turns out to be equiv¬ 
alent to its batch counterpart. [2] We outline a slight adaptation of this 
algorithm to regression, as the remainder of the report uses regression as 
a case study of how Mondrian processes can be utilized in machine learn¬ 
ing. In particular, the Mondrian process will be used to construct a fast 
approximation to the computationally expensive kernel ridge regression 
problem with a Laplace kernel. 

The complexity of random guillotine partitions generated by a Mondrian 
process and hence the complexity of the resulting regression models is 
controlled by a lifetime hyperparameter. It turns out that these models 
can be efficiently trained and evaluated for all lifetimes in a given range 
at once, without needing to retrain them from scratch for each lifetime 
value. This leads to an efficient procedure for determining the right model 
complexity for a dataset at hand. 

The limitation of having a single lifetime hyperparameter will motivate 
the final Mondrian grid model, in which each input dimension is endowed 
with its own lifetime parameter. In this model we preserve the property 
that its hyperparameters can be tweaked without needing to retrain the 
modified model from scratch. 
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Preliminaries 


0.1 Notation 

Capital letters are used for counts: N stands for a number of data points and D for the dimensionality 
of the input space. When we consider Mondrian forests, M will denote the number of Mondrian trees in 
the forest. In later chapters we will compute a randomized feature space and we will use C to denote the 
number of its dimensions (number of features). Whenever possible, we will use matching lowercase letters 
as indices over corresponding ranges, i.e., we will use n to index datapoints, d to index input dimensions, 
Til to index Mondrian trees and c to index random feature space dimensions. 

Matrices and vectors are typeset in boldface (e.g., A, 9 ) with the only exception of feature vectors. 
Throughout this report X £ M. NxD stands for a data matrix (also called the design matrix) whose n-th 
row £ R 13 is the feature vector of the n-th data point (in input space). The (n, d)-entry x n d of this 
matrix is the value of feature d for datapoint n. Once we use a function z to map input data points into 
a randomized feature space, we will have a feature matrix Z £ R JVxC whose n-th row £ R c is the 
feature vector of the n-th data point in the new C-dimensional feature space. 

By e.; we will denote the *-th standard basis vector, i.e. a binary vector with a single 1 entry in 
position i. The dimensionality of this vector will be clear from context. The identity matrix of dimension 
k x k will be written I*,. 

The indicator function I(P), takes value 1 when the predicate P is true and the value 0 otherwise. 


0.2 Mathematical preliminaries 

In this section we recall basic mathematical concepts from applied probability that we would like to use 
throughout the report without repeated elaboration. 


Probability distributions 

Definition 0.1. The exponential distribution with rate A > 0, written Exp(A), is the continuous 
probability distribution on R with probability density function p(x |A) = Ae^ Aa: I(a: > 0). 


Recall that the rate A of an exponential random variable is inversely proportional to its mean j (see 
Proposition A.l), meaning that variables with large rate will tend to take smaller values, and vice versa. 


Definition 0.2. For D > 1, the D-dimensional (non-degenerate) Gaussian (or Normal) distribution 
with mean p £ R 33 and (positive definite) covariance X £ R 33x33 has density 


■A/’Mp.e) 


1 

(27T)f |S|3 


exp 


(-^( x -m) t £ : (x- 



where |S| is the determinant of S. In the case D = 1 we have S = |X| = a 2 > 0 and we call this the 
variance; its inverse p := a~ 2 is then called the precision. 


Lack of memory and competing exponential clocks 

The exponential distribution plays a major role in the construction of the Mondrian process. This is 
because of its lack of memory property and the related concept of competing exponential clocks, which 
lead to elegant properties of the Mondrian process. 

Lemma 0.3 (Simple lack of memory property). Let Z ~ Exp(A). Then 

\/t> 0 ((Z -t)\ (Z > t)) ~ Exp(A) 

In words, the residual lifetime Z — t given survival {Z > t} is again Exp(A) distributed. 

Proof. The more general Lemma |0.4| below is proved as Lemma |A.3| in the appendix. □ 
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It is interesting to note that the exponential distribution is the unique probability distribution sup¬ 
ported on the positive reals with this property (see Proposition A.2 in the appendix). It turns out that 
the lack of memory property of the exponential distribution also holds at a random time, provided that 
it is independent of the exponential random variable considered. 


Lemma 0.4 (Lack of memory property). Let Z be an exponential random variable and T an independent 
nonnegative random variable. Then Z has the lack of memory property at the random time T, i.e. 


Vm > 0 P(Z -T > u\Z >T) = P(Z>u) 


Proof. Proof appears as Lemma A.3 in the appendix. 


D 


Definition 0.5. A set of N competing exponential clocks is a collection of N independent exponential 
random variables Z\..... Zjq with respective rates Ai, ..., Ajv >0. 


We can think of these N random variables as clocks, all started at the same time 0, and each having 
an independent and exponentially distributed residual time until ringing. Natural questions to asks are: 
when will the first of the N clocks ring and which clock is it going to be? Once the first clock rings, what 
is the joint distribution of residual times of the remaining N — 1 clocks? The lack of memory property of 
the exponential distribution leads to simple answers, entailed in the following theorem. 

Theorem 0.6 (Competing exponential clocks). Say Zi ,..., Z jv are N competing exponential clocks with 
respective rates Ai,..., Ajv- Then 

• the time until any of the clocks rings has Exp(]T) n A„) distribution, 

• the probability that the n-th clock is the first one to ring is , 

• conditionally given the time and identity of the first clock to ring, the remaining N—l clocks remain 
independent and each has preserved its original Exp(A n ) residual time distribution. 


Proof. Partial proofs are given in Appendix A. 


□ 


Statistical parameter estimation 

Say we have a probabilistic model parametrized by 6. The likelihood C(6\T>) is the probability p(V\0) 
of observed data T> under this model as a function of the parameters 6. The maximum likelihood estimate 
(MLE) of the parameters is 

0 MLE := argmaxp(2?|0) 
e 

Suppose that before observing any data, we also have a prior belief about the value of the parameters 6, 
encoded as a prior probability distribution p(0). Then the maximum a posteriori (MAP) estimate of 6 
is the set of parameters that maximizes the posterior distribution p(6\V): 

G uap := argmaxp(<?|2?) = argmax —— . ^ - —- = argmaxp(2?|0)p(<?) 
e o P\T>) e 

We say that a prior is conjugate for a likelihood, if the resulting posterior is a probability distribution 
from the same parametric family as the prior. 

Example 0.7. Suppose we want to model data y GM. with a Gaussian likelihood p{y\p) = Af(y\p, cr;( oise ) 
where cr)( oise is fixed and the mean p is unknown. Say we place a prior distribution p{p) = Af(p\p pi i OI , <Tp rior ) 
on /i to express our belief that p is not far from p pr j or . Then the prior is conjugate to the likelihood and 
the posterior distribution is again Gaussian. More concretely, if we gather N independent observations 
V = {yi,... ,y n } then the posterior distribution of p is 

, i„, .r ( | Ppriorltprior + Pnoise X/n=l V n I , AT \ — 1 | 

p{p\V) = AI \p | -——-— , (Pprior + N Pnoise ) 

\ Pprior i -^Pnoise / 

where p pr ior = fp^or anc ^ Pnoise = cr“ oise are the prior and noise precisions (inverse variances), respectively. 


Proof. Appears as Proposition |A.6|in the appendix. 


□ 









Mondrian process 


In this chapter we define the Mondrian process [I] as a temporal stochastic process taking values in 
guillotine partitions of an axis-aligned box and then attempt to give intuitive explanations for some of 
its elegant properties. Let us start by agreeing on terminology. 

Definition 1.1. A temporal stochastic process taking values in a space S is a collection (M t )t> o of 
5-valued random variables, indexed by a parameter t £ [0, oo) that we think of as time. 

Definition 1.2. An (axis-aligned) box 0 in R- 0 is a set of the form 0 = 0i X"'X0flC R D , where 
each 0rf is a bounded interval [a,d,bd]- We only work with axis-aligned boxes in this report, so we drop 
the ’’axis-aligned” qualification henceforth. 

Definition 1.3. The linear dimension of a box 0 = [ai,6i] x • • • x [an, bo] in R 15 , written LD(0), is 
the sum of its D dimensions, i.e., LD(0) := J2d=i(^d ~ o,d)- 

Definition 1.4. Given a box 0 in R 15 , a guillotine partition of 0 is a hierarchical partition of 0 
obtained by recursively splitting boxes of the partition by some hyperplane orthogonal to one of the D 
coordinate axes. 

Guillotine partitions can be thought of as k-d trees, where each node corresponds to a box in R- 0 and 
each non-leaf node n has exactly two children, corresponding to the two boxes obtained after cutting the 
box associated with n by a hyperplane that is orthogonal to one of the D coordinate axes. 

Mondrian process definition 

In this subsection we define the Mondrian process over a box 0 = [ai, &i] x • • • x [ao, bo] as a temporal 
stochastic process taking values in guillotine partitions of 0. An intuitive way of thinking about the 
process is that it starts out at time t = 0 with the trivial partition of 0 (containing no cuts) and as time 
progresses, new cuts start to randomly appear, hierarchically splitting 0 into more refined partitions. 
The precise distribution that governs how the cuts appear is given by this recursive generative process: 


1 

procedure Mondrian(0) 


2 

return Mondrian-Started-At(0, 0) 


1 

procedure Mondrian-Started-At( 0, t 0 ) 

> 0 = [ai,bi\ x • • • x [a D ,b D \ 

2 

T ~ Exp(LD(0)) 

> time until first cut appears 

3 

d ~ Discrete(pi,... ,pd) where pa oc (bd — dd) 

> dimension of that cut 

4 

x ~U{{a d ,bd ]) 

> location of that cut 

5 

M < 4 - Mondrian-Started-At( 0<,t 0 + T ) where 0< = {zg0 

\ Z d <x} 

6 

Af> 4 - Mondrian-Started-At( 0>,t 0 + T) where 0> = {z e 0 

\ Zd>x} 

7 

return (t 0 , to + T, d, x, M < , A/ > ) 



The recursive procedure Mondrian-Started-At(0, to) generates a Mondrian process on the box 0, 
started at time to- Let us analyze this procedure line by line: 

• Line 2 generates the time it takes for the first cut in 0 to appear. The distribution is exponential 
with rate the linear dimension of 0. Note that by Proposition ! A. 1[ in larger boxes a cut is expected 
sooner than in smaller ones. The absolute time to + T of the generated cut is called its birth time. 

• Lines 3 and 4 generate the dimension d and location x of the first cut, respectively. The former 
is generated proportionally to the dimensions of 0 and the latter is then chosen uniformly. The 
cutting hyperplane is orthogonal to the d-th coordinate axis and crosses it at the point x. Thus the 
cutting hyperplane ’’lives” in the d-th dimension. 

• Lines 5 and 6 recursively generate independent Mondrians M < , M > on the two boxes 0 < , 0 > 
obtained by cutting 0 at a; in dimension d. The start time of these Mondrians equals the birth 
time of the cut that gave rise to 0 < and 0>. Note that the spaces 0 < , 0 > are indeed still boxes 
in R-°, so the recursive calls are valid. 

• Line 7 returns a node of the k-d tree representing the hierarchical partition. The node is a 6-tuple 
of the form (tt,,t c ,d,x,M < ,M > ), where the entries represent respectively the birth time, the cut 
time, the cut dimension, the cut location and the two children of the node. Note that the cut time 
of a node equals the birth time of the cut that splits it. 
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Remark. Several remarks about this generative process are in order: 

• Lines 3-4 can be informally summarized as sampling the cut uniformly from the linear dimension. 

• The distributions Exp(LD(0)) and Discrete(pi,. .. ,pd) (where pd oc {bd — ad)) are well-defined 
provided that the linear dimension of 0 is positive. If we start with a box 0 of positive dimension, 
then with probability 1 the cut location x is sampled in an interior point of [ad, bd] and then both 
0 < and 0 > also have positive linear dimension. 

• We are somewhat sloppy about the generated partition as the cutting hyperplane is included in both 
0 < and 0 > . This informality can be excused since any specific point of interest has probability 0 
of being hit by a cut (the cut location x is generated from a continuous distribution). In particular, 
with probability 1 no cut will appear in the same location where a cut has already been made. 

This generative process translates into the definition of a temporal stochastic process as follows. 

Definition 1.5. Let 0 be a box in R- 0 with positive linear dimension. The Mondrian process on 0, 
denoted as MP(0), is a temporal stochastic process (M t ) t > 0 taking values in guillotine partitions of 0 
and its distribution is specified by the generative process Mondrian(0): the random variable M t is the 
guillotine partition of 0 formed by cuts/nodes with birth time < t. 

In other words, M t is the partition generated by Mondrian(0) with all cuts/nodes born after time 
t ignored. In fact, we can generate the random variable M t precisely by running the recursive process 
Mondrian(0) and terminating any recursive call that generates a cut with time t 0 + T > t. (This is the 
way the Mondrian process has first been introduced in 1|.) 

Definition 1.6. Let 0 be a box in K 15 with positive linear dimension and let t > 0. The Mondrian 
process on 0 with lifetime f, denoted as MP(f, 0), is the law of M t where M ~ MP(0). 

For fixed t > 0, MP(t, 0) is simply a probability distribution over guillotine partitions of 0. Note 
that existing cuts are never removed from a Mondrian process M, so it exhibits the following kind of 
monotonicity property: 

0 < t\ < t -2 => the partition M t2 is a refinement of the partition M tl 

Therefore in the family of probability distributions (MP(f, 0))t>o over guillotine partitions of 0, the 
lifetime parameter t can be thought of as controlling the complexity of the resulting partition. The 
generative process of the Mondrian chooses cut locations uniformly at random, so it is in the way the 
times of the cuts are generated where the ingenuity of the Mondrian process construction lies. The 
resulting elegant mathematical properties, which we explore in subsequent sections, follow from the 
memoryless property of the exponential distribution and the related concept of competing exponential 
clocks (Theorem |0.6[ ). 

Example 1.7. Say we sample from a Mondrian process M on a 2D 
box 0 = [0,1] x [0,1] with a lifetime cut-off at t = 1.5. The figure on 
the right shows the obtained cuts, together with their birth times. 

The process starts at time t = 0 with Mq the trivial partition of 0. 

The first cut appears after time T ~ Exp(A) where A = l + l = 2is 
the linear dimension of [0,1] x [0,1]. At time T the first cut appears 
at a location chosen uniformly from the linear dimension of 0. More 
precisely, first the dimension d of the cut is chosen with probabilities 
proportional to the lengths of 0 in each dimension. In our case 
0 has length 1 in both dimensions, so the cutting dimension d is 
chosen with equal probability ^ from {1,2}. After the dimension 
d is generated, a point x in [0,1] is chosen uniformly at random. 

The cut is then determined by the hyperplane (in our case, a line) 
lying entirely in dimension d and containing the point x on the d-th 
coordinate axis. 

For the sample shown in the figure we generated T « 0.23, d = 1 and x ~ 0.71. For all t £ [0, T), 
M t remains the trivial partition of 0. The cut made at time T partitions the box 0 into two sub-boxes 
©< = [0, x] x [0,1] and 0 > = [ x , 1] x [0,1]. In each of these two sub-boxes the Mondrian process continues 
to run independently and afresh, started at time T. 
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1.1 Mondrian process in ID 

As a first illustration of how the choice of exponential distribution yields elegant properties of the Mon¬ 
drian process, we consider the one-dimensional case, where it turns out that the cut locations follow a 
Poisson point process. The following definition of a Poisson point process is adapted from 3]. 

Definition 1.8. Let A > 0. A random countable subset II of K is a Poisson point process with 
(constant) intensity A, if, for all A £ £>(R), the random variables N(A) := |Iln A\ satisfy: 

(i) N(A) ~ Poisson(Am(A)), where m(A) £ [0, oo] is the Lebesgue measure of A, and 

(ii) if Ai ,..., An are disjoint sets in £>(R) then N(Ai),..., N{A n ) are independent random variables. 
Here B{ R) is the Borel cr-algebra on R. Also, we allow a Poisson distribution with infinite rate, in which 
case N (A) = oo almost surely. 

Suppose we run a Mondrian process M on a one-dimensional axis-aligned box 0 with positive linear 
dimension, which is simply an interval 0 = [a, b] with a < b. Up to a finite lifetime A, the process 
generates a hierarchical partition of [a, 6 ], with each cut having a birth time tb £ [0, A]. Let us now only 
consider the marginal distribution of the cut locations (marginalizing out their hierarchy and times). This 
is a distribution over subsets of [a, b] and in the following lemma we give a simple representation for it. 

Lemma 1.9. Let a < b and A > 0. The distribution of the cut locations {X n } of a Mondrian process AI 
run on [a, b] with a finite lifetime A can be represented by the following two-stage generative process: 

N ~ Poisson(A(& — a)) 

X x ,...,X N \ N z X d -U([a 7 b\) 

In words, the number of cuts is Poisson distributed with rate A (6 — a) and the location of each cut is 
independent and uniformly distributed in the interval. 

Proof. Fix a time instant t £ [0, A] and suppose we are conditionally given the evolution of the process 
M up to time t. Let K — 1 be the number of generated cuts, so that the interval [a, b] is partitioned into 
K segments of the form [xq,xi], [cci,^], ■.., [ Xk-i,xk\ with a = Xq < xi < ■ ■ ■ < Xk = b. The time 
until the next cut appears in a segment [xk,Xk-i] has by memorylessness (Lemma |0.4[ ) Exp(xfc — Xk- i) 
distribution and is independent of all the other segments by construction of the Mondrian process. 

Thus we are in the setting of K competing exponential clocks and the residual time until the next cut 
in [a, b] appears has exponential distribution with rate J2iLi( x k ~ %k-i) = %k — xq = b — a. Also, the 
probability of this cut occurring in a particular segment [xk, Xk-i] is proportional to its length (xk—Xk-i). 
Within the chosen segment the location of the cut is generated uniformly, so marginally the location of 
the next cut in [a, b] is uniformly in [a, b\. 

Thus we’ve shown that at any time instant t , given the past evolution of the process, the residual 
time until the next cut appears is Exp (6 — a) distributed and its location is chosen uniformly from [a, b\. 
As these distributions do not depend on the past evolution, the residual time until the next cut and its 
location are both independent of this past evolution. The times of the cuts form a temporal Poisson 
process with rate b — a, so their number in a time interval of length A is Poisson(A (6 — a)) distributed. □ 




~ Exp (£ 5 — £ 4 ) 


~ Exp(.T 4 - x 3 ) 


~ Exp(a ; 3 - x 2 ) 


~ Exp(.T 2 - £ 1 ) 

— 

~ Exp(xi - £ 0 ) 
- 1 -> 


time 


Figure 1.1: An illustration of the proof of Lemma 1.9 


Here K = 5. 


1.9 
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Theorem 1.10. Let a < b and A > 0. The distribution of the cut locations of a Mondrian process M 
run on [a,b\ with a finite lifetime X is a Poisson point process with constant intensity A. 


Proof. It suffices to show that a Poisson point process with constant intensity A can be generated using 
the two-stage generative process of Lemma |1.9[ The number of points N generated by a Poisson point 
process with constant intensity A on [a, b] has Poisson(A(6 — a)) distribution by definition, matching 
the first stage of the generative process. Conditionally given that the Poisson point process generated 
N = n points, their locations are independent and uniformly distributed, matching the second stage of 
the generative process. (A proof of the last statement is given as Lemma A.7 in the appendix.) As the 
cut locations of a ID Mondrian process and of a Poisson point process can be sampled using the same 
procedure, their distributions must coincide. □ 


1.2 Self-consistency of the Mondrian process 


This section is concerned with the following natural question: if 
we run a Mondrian process on a larger box but only look at what 
happens in a smaller subbox, what distribution of random partitions 
of the subbox do we obtain? More formally, consider the setup 

M ~ MP(0i x ••• x e D ) 

$1 X"-X$flC0 1 X"’X0£i 

i.e., we run a Mondrian process M on an a box 0 := 0! x • • • x 0£> 
and consider some smaller box $ := $! x • • • x $£> contained within it. 

Some cuts of M will cross $ and thus induce a guillotine-partition¬ 
valued stochastic process on <f>. The Mondrian process was conceived 
precisely so that the distribution of this stochastic process is again a 
Mondrian process [4]. Here we give an intuitive argument for where this self-consistency property comes 
from. The choice of the exponential distribution for the times of the cuts turns out to be crucial, as is 
the notion of competing exponential clocks. 

Theorem 1.11 (Self-consistency of Mondrian process). The law of the restriction of M ~ MP(0) to a 
smaller box $ C 0 is again a Mondrian process. 





<J ) 2 




$i 


0i 


We provide intuition for the case D = 2, but the ideas generalize 
to any number of dimensions. To argue that the resulting distribu¬ 
tion on $ is a Mondrian process, we show that the Mondrian process 
M running on 0 generates cuts in $ in the same way as a Mondrian 
process running directly on $ would. 

The first cut in M occurs at time T ~ Exp(LD(0)) and its 
location is uniformly distributed along the linear dimension of 0. 
Employing the notion of competing exponential clocks ’’backwards”, 
we can represent this distribution of time and location of the first 
cut using two competing clocks: 

(1) Clock C\ with rate Ai = LD(<I>). If this clock wins, the loca¬ 
tion of the cut is sampled uniformly from the locations where 


0 2 



( 2 ) 


making a cut splits <f> (green segments in Figure 1.2). 


Figure 1.2: Representing the first 
cut distribution using two compet¬ 
ing exponential clocks. 


Clock C *2 with rate A 2 = LD(0) — LD($). If this clock wins, 
the cut location is sampled uniformly form the locations where 
making a cut doesn’t split $ (red segments in Figure B 
Indeed, under this representation the time until the first cut is ex¬ 
ponentially distributed with the correct rate Ai + A 2 = LD(0) and 

the cut location is sampled uniformly from the linear dimension of 0 since the probability that clock C± 
wins is proportional to Ai = LD($). 
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Note that clock C\ represents the same distribution of the first cut as a Mondrian process running 
directly on $ would. Of course, it may happen that instead clock C 2 wins, and a cut is made outside 
of <f>, as illustrated in Figure [L3| below. But observe that when such a cut is made, the measure of the 
locations where a cut splitting $ can be made (the green segments) remains to be LD(<F). Therefore 
instead of considering two new competing exponential clocks as above, for we can reuse the 


clock Ci that continues to run as an independent exponential clock of rate Ai = LD(<f>) by Theorem 0.6 


02 



© 2 > 


© 2 < 


e> 


H 







e< 


Figure 1.3: Cut outside <f>. Figure 1.4: Cut inside <F. 

Hence, cuts made outside of $ do not affect the distribution of the first cut within $, and this 
distribution is the same as if a Mondrian process was running on $ directly. Now consider the situation 
when finally a cut is made within H>, as illustrated in Figure [T~4) By definition of the Mondrian process, 
the processes on the two sides 0 < , 0 > of this cut continue to run independently, and therefore their 
restrictions to $ are also independent. Thus our argument proceeds by induction, confirming that the 
generative process for the cuts within $ induced by the Mondrian process M run on 0 is the same as of 
a Mondrian process running directly on $. □ 

Example 1.12 (Mondrian slices). An interesting special case of self-consistency is pointed out in [lj. 
Suppose that <f> = $1 X • • • X <f>£> lives in lower dimension than 0, for example $1 = { 2 ;} for some 
x £ 01. As the probability of making a cut precisely at the point x is zero by continuity of the uniform 
distribution, a.s. all cuts of 0 splitting <!> live in the remaining dimensions d ^ 1. Therefore the restriction 
of M ~ MP(0) to *!> can be viewed as a (d— 1) -dimensional Mondrian process run on $2 X • • • x $£>. 

This observation provides some insight into how partitions generated by a Mondrian process look like. 
Along any axis-parallel line, the locations of the cuts crossing it follow the distribution of a ID Mondrian 
process, which has been shown to coincide with a Poisson point process. 

An important corollary of self-consistency is that it provides the Mondrian process with the projec- 
tivity property required for extending its definition form bounded boxes to the entire R D . 

Definition 1.13. The Mondrian process on S. D with lifetime A > 0, written MP(A,M D ), is the tem¬ 
poral stochastic process taking values in (infinite) partitions of K 15 with the property that its restriction 
to any bounded box 0 has the law MP(A, 0), as defined earlier. 
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1.3 Conditional Mondrians 


Conditional Mondrians are a dual notion to consistency. Similarly as before, we have the setup 


$ := x 


M ~ MP(A, 0i x • • • x 0 D ) Ae[0,oo] 
X C 0! X • • ■ X Q D 


but this time we are conditionally given the restriction = m $ of M to the smaller box <f>. (Both the 
locations and times of cuts in <f> are given.) The question we want to approach is, what is the conditional 
distribution M | (M $ = to $ ) and can we sample from it? 

The answer is positive and provides a way of extending an existing sample ~ MP(A, <f>) on <f> to 
a sample M on the larger domain 0 in such a way that the extended sample has the correct marginal 
distribution M ~ MP(A,0). This is because by self-consistency MP(A, <L) can be interpreted both as a 
Mondrian process running on $ or as the restriction to $ of a Mondrian process running on 0. 

Theorem 1.14. Suppose we are conditionally given the restriction = m® of a Mondrian process 
M ~ MP(0) to a smaller box $C0. Let C $ be the first cut in and let t® be its time. Then 

• with probability exp (t® (LD(Q) — L_D($)), the cut C $ is the first cut in 0 (it extends throughout Q) 

• with complementary probability 1 — exp(t $ (LP(0) — LP(4>)) the first cut in 0 misses $, its time 
has the truncated exponential distribution with rate LD(Q) — LD(<&) and truncation at t®, and the 
cut location is uniformly distributed along the segments where making a cut doesn’t hit $. 

Again we only provide an intuition for this result. A calculation 
using the self-consistency property for the case where m $ is the 
trivial partition of <f> can be found as Lemma |A.8| in the appendix. 

Observe that the stated probability of C® being the first cut in 
0 is the likelihood of an exponential clock with rate LD(0) — LD(<f>) 
not to ring at least until time t®. 0 9 

Once again we represent the unconditional distribution of the 
first cut in 0 by two competing exponential clocks Ci, C 2 as in the 
section on self-consistency. Recall that clock C\ has rate Ai = LD(<f>) 
and is associated with the green segments, where making a cut splits 
$. Clock C 2 has rate A 2 = LD(0) — LD(<f>) and is associated with 
the red segments where making a cut misses <f>. Our conditioning 
on = to® tells us that clock C 1 rang at time and the two 
cases in the statement of the theorem correspond respectively to the 
situation where it was the first and where it was the second clock to 



Figure 1.5: Conditional Mondri¬ 
ans. C $ is the first cut in <f>. 


ring. 

If Ci was the second clock to ring, we know from our representation that the location of the first cut 
in 0 is uniformly distributed along the red segments associated with the winning clock C 2 . Also, in that 
case the time of this cut has exponential distribution with the rate A 2 of clock C 2 , but truncated at t® 
since we assume that C 2 rang before t®. □ 

Hence we obtain a simple algorithm for sampling from the conditional distribution M | (M $ = to' 1 ’): 
we sample X 2 ~ Exp(A 2 ), the time when clock C 2 rings. If T 2 < we extend the first cut in $ to the 
whole of 0; otherwise we sample the first cut of 0 uniformly from the locations where it won’t hit 4>. In 
both cases we thus obtain the first cut in 0. Then by definition of the Mondrian process we may proceed 
independently and recursively on the two boxes 0 < , 0 > created by the first cut in 0. (Note that if this 
cut is not then on one of its sides we are no longer conditioning on anything, i.e. an unconditional 
Mondrian will be sampled in that recursive call.) 
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1.4 Remarks 

In this chapter we have defined the Mondrian process and attempted to give intuitive explanations for 
how the choice of exponential distribution and the notion of competing exponential clocks translate into 
some of its elegant properties. A rigorous treatment of the Mondrian process requires additional concepts 
from measure theory and can be found in Dan Roy’s PhD thesis [£. For example, one of the issues we have 
ignored in our exposition is the possibility of the process exploding, i.e. infinitely many cuts occurring in 
a bounded box in finite time. Roy [2j confirms that this happens with probability 0. 

Also, the Mondrian process can be defined slightly more generally. The only property of the uniform 
distribution for sampling cut locations that we have used in our arguments is that it is continuous and 
therefore with probability 1 no two cuts occur at the same location. Thus instead we may take D atomless 
measures /zi,..., Hd on K, define the linear dimension of 0 = 0i x • • ■ x 0£> as LD(0) = A*d(0d) 

and sample cut locations in dimension d from the (possibly unnormalized) measure fid- This preserves the 
self-consistency property and the notion of Conditional Mondrians, while the one dimensional Mondrian 
becomes a Poisson Point Process with (non-constant) intensity function A/zi. Our intuitive arguments 
translate into this more general setting by replacing all interval lengths y—x in dimension d with /id([x, j/]). 



Figure 1.6: Piet Mondrian: Composition with Large Red Plane, Yellow, Black, Gray, and Blue (1921). 
The Mondrian process has been named after the French painter Piet Mondrian due to the resemblance 
of some of his work to partitions generated by a Mondrian process [2]. However, note that unlike the 
depicted painting, partitions generated by a Mondrian process will not have two cuts crossing each other. 






Mondrian forests 


Apart from exhibiting elegant properties, the Mondrian process turns out to be useful in various machine 
learning tasks. In this chapter we give a high-level description of Mondrian forests, a concept introduced 
in 12] for online random forest classification. However, in line with the focus of subsequent chapters, we 
concentrate on regression rather than classification here. The regression problem is defined as follows. 

Definition 2.1. Regression is the problem of learning a function / : R 13 —»• R from a set of training 
data points T> = {(xi, j/i),. .., (xjv, vn)} Q R d x R, where y n is a possibly noisy observation of /(x„). 
Given a new test point x* € R D , the learned function / predicts y = /(x*) for the value of /(x*). 

In particular, we assume the input space to be R 13 . The inputs x are D-dimensional vectors whose 
components are called features or attributes. Each feature can be thought of as a quantifiable property 
of the input, and one hopes that these features provide information useful for estimating the target value. 

A powerful idea exploiting the assumption that nearby points tend to have similar target values is to 
partition the input space R D into connected blocks R D = |_| igJ Bi and to use a simple regression model 
in each block. For example, when asked for a prediction at a test point x* £ R 13 , we might return the 
average target value across those training points (x ra ,?/ n ) that fall into the same block R; as x* does. 

Instead of a single partition, a random forest model obtains M partitions from M independent decision 
trees that hierarchically partition the input space. At test time, each tree provides a prediction and their 
average is returned. Using several trees instead of a single one is a bias reduction technique, useful because 
the partition generated by a single tree is rarely complex enough to match the patterns in training data. 

A Mondrian forest algorithm uses M independent samples from a Mondrian process with finite lifetime 
A to provide the M partitions of R 13 . For Mondrian forest regression, in each cell of each partition we use a 
constant prediction model with a Gaussian prior <Tp rior ) and Gaussian observation noise Af(Q, cdU se )> 
as in Example |0.7| Apart from acting as a regularizes the prior ensures that predictions are well-defined 
in partition cells with no training data. 


Input: training set V = {(xi, yi ),..., (x n , y n )}, lifetime parameter A > 0, test point x* 
Output: estimate y of /(x*) that utilizes information from the training data D 
1 : procedure Train(D, A) 

2 : for m = 1 to M do 

3: T m ~MP(A,R D ) 


5: return (T, p.) 

1: procedure Predict(x*) 

2 : return j-j ^ m=1 > Z m (x*) is the leaf into which x* falls in tree T m 


For each partition cell (leaf) c in T m , compute the 






The algorithm prescribes sampling Mondrian processes on Euclidean space R D , which is strictly 
speaking impossible as they contain infinitely many cuts with probability 1. However, we can invoke 
self-consistency and only sample the Mondrians on a bounded box containing all the datapoints. This is 
sufficient because the Mondrian samples are only used to partition the datapoints. When new training 
points arrive in an online learning setting, the notion of Conditional Mondrians allows us to extend the 
M existing samples to larger regions if necessary. 

In the test phase we also need to incorporate test points x* into the partition. We could again employ 
Conditional Mondrians if the Mondrian samples have not yet been instantiated at the point x*. However, 
[2 points out that it is easy to consider all possible extensions of the partitions analytically and compute 
a prediction by integrating over them. Whenever the test point x* lies outside of the region where a 
Mondrian sample is instantiated, the notion of Conditional Mondrians tells us exactly the probability 
with which x* is separated from the other datapoints by a new cut, in which case the prediction made at 
x* is simply the predictive prior. 

For a more detailed description of Mondrian random forests we refer the interested reader to [ 2 ], where 
the aforementioned procedures are transparently presented. 
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Predictive behaviour far from training data 

When a test data point x* lying far from any training points arrives, the probability that a cut separates 
it from the training data is high and in that case the predictive distribution is simply the prior. So for a 
test point x* far from training data, thanks to integrating over all possible extensions of the M Mondrian 
samples to incorporate x*, the predictive distribution is (close to) the prior. Hence we do not observe 
over confident predictions far from training data, as we do in some other random forest models j5]. 

Classification 

The Mondrian random forest model for classification presented in |2] uses the same model for partitioning 
the input space as outlined above for regression. It only differs in the predictive model used in the leaves, 
which needs to predict classes rather than a continuous target value. A hierarchical Bayesian modeling 
approach is taken to achieve a smoothing effect: the hierarchical partitions provided by the Mondrian 
samples are treated as trees and each node of the tree (not just the leaves) is associated with a predictive 
distribution. Under the prior, the predictive distribution of a non-root node n is modeled as a normalized 
stable process (NSP) with base distribution being the predictive distribution of n’s parent. 

Density estimation 

Density estimation differs from regression and classification in that it is an unsupervised problem, i.e., 
no labels are observed in training data. 

Definition 2.2. Density estimation is the problem of learning a probability density p from a set of 
training samples V = {xi,...,xjv} C generated from p. Given a new test point x*, the learned 
density p estimates p(x*) for the true density at point x*. 

A Mondrian random forest model for density estimation needs to be able to predict density values 
in its leaves, noting that a probability density must integrate to 1. To this end, we associate cells of 
the partitions generated by the Mondrians with probability masses, ensuring that the total mass in one 
partition is 1. However, to arrive at the density, the probability mass associated with a box needs to be 
divided by the volume of that box. This requires us to be able to compute volumes of the partition cells 
generated by the Mondrians, unlike in regression or classification where it was only the partition induced 
on the data points that was relevant. As density estimation is not a main theme of this report, a more 
detailed description of Mondrian random forest density estimation is given in the appendix. 


2.1 Empirical evaluation 

Note that the partitioning of the input space by M Mondrian samples does not take into account labels 
(target values) of the training data points (in the case of regression and classification, where these labels 
are present). It is therefore quite remarkable that an algorithm like this can still achieve as competitive 
predictive performance as shown in 2]. 

We have implemented a Mondrian random forest algorithm both for regression and density estimation. 
More illustrations of empirical results obtained are shown in the chapter on regularization paths, where 
the models are endowed with an additional functionality that allows them to be evaluated much more 
efficiently. 
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Test error rate vs PCA dimension D (classes: [0. 9]) 
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Figure 2.1: We train a Mondrian random forest 
density estimator on the Setosa class of the well- 
known Iris dataset from UCI repository © , using 
several values of the lifetime parameter A. Re¬ 
call that the lifetime controls the complexity of 
the partitions generated by the Mondrian process. 
The horizontal axis of the figure shows this life¬ 
time parameter A of the Mondrian process from 
which M = 200 samples were drawn. The vertical 
axis show the log-likelihood of the learned model 
on both training (red) and test (green) data. The 
plot shows how the lifetime parameter affects the 
complexity of the model: the training likelihood 
increases as the model gets more complex and is 
able to fit the training data better, while the test 
log-likelihood peaks and then starts to decrease 
as the model overfits to the noise present in train¬ 
ing data. Note that for each value of the lifetime 
shown, we have trained a new Mondrian random 
forest density estimation model from scratch. In 
the chapter on computing entire regularization 
paths we will present a much more efficient ap¬ 
proach, where a single model will be evaluated 
for all possible lifetime values in a given range. 


Figure 2.2: A density estimation procedure can 
be used indirectly for classification by estimating 
the probability density of all classes at training 
time and when a new test point x* is presented, 
the density of all classes at x* can be estimated. 
Our prediction may then be the class with high¬ 
est density at x*. In the figure we compare two 
such classifiers: one that uses Mondrian random 
forest density estimation and another that uses 
standard kernel density estimation (KDE) with 
the squared exponential kernel. The dataset con¬ 
sists of images of digits 0 and 9, taken from the 
popular MNIST dataset [7j. The horizontal axis 
shows the number of dimensions into which the 
inputs were projected using linear PCA. The ver¬ 
tical axis shows the error rate of the classifiers 
in distinguishing the digits 0 and 9. It seems 
that the Mondrian approach is more competi¬ 
tive in smaller number of dimensions. We have 
attempted to set the hyperparameters of both 
models to sensible values and kept them constant 
throughout these experiments. 















Laplace Kernel Approximation 


This chapter presents a different way of utilizing Mondrian processes in regression problems, by way 
of approximating the Laplace kernel in a kernel ridge regression setting. We start by reviewing ridge 
regression as an instance of MAP parameter estimation and show how it can be kernelized. 


3.1 Ridge regression 


Say we want to model a dataset V = {(xi, yi),... 


(xtv, Un)} C R- 0 x R using a linear model of the form 


y n =x^O + £ n where 


£l, . . . ,£jV 


i.i.d. 


A(0, 


with cr n oise > 0 fixed and 9 to be learned. Suppose we express our belief that parameters are unlikely to 
take on arbitrarily large values by placing a spherical Gaussian prior 9 ~ Af(0, a 2 rior / d) on 9. It can be 


shown (see Theorem A.9 in the appendix for a proof) that the MAP estimate of 9 under this prior and 
likelihood can be found by minimizing the .^-regularized least squares objective function 


N 


f(9) ■= $ 2 \\9\\i + ^2(y n - x « 6> ) 2 


n =1 


where 6 := ^" oiac > 0. The S 2 factor weights the strength of the regularizer: regularization is strong 
when the ratio of noise and prior variances is large, allowing us to attribute any outliers to noise in the 
observations. Conversely, regularization is weak when the noise variance is small (in comparison to the 
prior variance), forcing the model to better match the training observations. 

The function f{9) is easily minimized using matrix calculus. To this end, we construct the design 
matrix X £ R JVxD whose n-tlr row is the data vector x^, and let y £ R w be a vector with n-tli entry set 
to y n . The function f(9) can then be vectorized as 

m = S 2 \\9\\ 2 + \\y-yL9f 2 

As Theorem |A.10| in the appendix shows, for <5 > 0 this function has a unique global minimum at 

0 MAP = (X T X + (5 2 I D ) _1 X T y 


Kernelizing ridge regression 


This formula requires inve rting the D x D matrix X T X + <5 2 I.d, which involves the feature covariance 
matrix X T X. As Theorem A.11 in the appendix shows, 0 MAP can be alternatively expressed in terms of 
the TV x N matrix XX T + <5 2 Iat, where instead the data covariance matrix XX T appears: 


0 MAP =x T (xx T + (5 2 l7v) -l y 


Occasionally we may have D > N, in which case this second form leads to more efficient inversion of a 
smaller matrix. However, we consider it because it allows the prediction y at a new test point x* to be 
expressed in terms of inner products between data points. More concretely, this prediction is 

V = **0 MAF = (xfX T )(XX T + <5 2 Iiv) -1 y 

Here x^X T £ R w is a row vector with n-th entry the inner product xfx n and XX T £ R JVxAr is the data 
covariance matrix with (*, j)-entry the inner product xfxj. The famous ’’kernel trick” is to observe that in 
a model like this where data locations only enter through inner products x T x', we can replace these inner 
products by a general kernel function fc(x, x'). This kernel function must correspond to inner products 
in some feature space, but this feature space can be arbitrarily complex, even infinite dimensional. The 
trick is that we are able to compute inner products in that feature space efficiently via the kernel function 
fc(x, x'), without the need to map input data to that feature space explicitly. Note that this moves us to 
the world of non-linear regression, since a linear function in the implied feature space usually does not 
correspond to a linear function in input space. 
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A function k is a valid kernel if and only if the resulting Gram matrix is positive semidefinite for any 
collection of datapoints {xi,... ,x„}. This result is known as Mercer’s Theorem. 

When we replace inner products x T x' by the kernel fc(x, x 7 ), the data covariance matrix XX T is 
replaced by the Gram matrix K with (i,j)-e ntry k(x i: Xj) corresponding to the inner product of the 
i-th and j- th training data point in the feature space implicitly represented by the kernel function k. 
Similarly, the row vector xJX. T £ is replaced by k(x*,X) £ R", which is a vector with n-th entry 
equal to fc(x*,x„). Then the prediction y is given by 

y = k(x*, X)(K + <5 2 Ijv) -1 y 


3.2 Kernel approximation 

To compute this prediction y we need access to the inverse of the N x N matrix K + S 2 In■ Even 
though this inversion only needs to be performed once, the 0(1V 3 ) computational cost of inverting such 
a matrix is prohibitive with large datasets. It is not possible to revert to the earlier formulation of the 
ridge regression solution involving a D x D matrix inversion, because that formulation is not in terms 
of inner products between datapoints. Instead, it has been proposed in [ 8 ] to compute a randomized 
low-dimensional feature map z : R 23 —> R c (with C <C N) such that 

fc(x, x 7 ) « z(x) T z(x') 

i.e., such that inner products in the generated low-dimensional feature space approximate the desired 
kernel k. Using z to map each data point x to its low-dimensional feature representation z(x), we can 
then use the first ridge regression solution formulation 0 MAP = (Z 2 Z + <5 2 Ic) -1 Z T y, where Z £ R A7xC ' 
is the feature matrix with n-th row equal to z(x) T . This only requires inverting a C x C matrix, allowing 
this (approximate) solution to be computed in time 0(C 3 C 2 N). As N is expected to dominate C, 

this is essentially 0(C 2 N). 

Many kernels turn out to be effectively approximable this way, with || 8 ] giving two general schemes 
for constructing the random feature mapping z. In this section we show how the Mondrian process can 
be used to approximate one particular kernel, the (symmetric) Laplace kernel. 

Definition 3.1. The (symmetric) Laplace kernel is given by 

/ D 

k(x,x') = exp (—A||x — x 7 ||i) = exp I -A^ \x d - x' d \ 

V d= 1 


where A is a lifetime parameter of the kernel. 

Remark. The Laplace kernel is usually equivalently defined with a length-scale parameter a that is 
related to our lifetime parameter A via A = 1/2 <t 2 , or A = l/2cr or A = 1/cr. Our parametrization and 
naming of the A parameter as lifetime is non-standard, chosen here because of the connection to the 
Mondrian process lifetime that will be revealed next. 

The term ’’symmetric” is used here to point out that this kernel has a single lifetime parameter A 
common to all D dimensions. The last chapter on the Mondrian grid is concerned with approximating 
the general Laplace kernel, where each dimension can have a different lifetime parameter A d- 
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Symmetric Laplace kernel approximation 

To approximate the Laplace kernel for a collection of clatapoints {xi,..., xjv}, suppose we sample a 
Mondrian process on ]R D with a finite lifetime A. As we will only be interested in the partitioning of 
data points induced by the sample, by self-consistency we can simply sample the Mondrian on minimal 
bounded boxes containing the data points, as in Mondrian random forest regression. Let C be the number 
of non-empty partition cells (containing at least one data point) of the sampled Mondrian and label them 
as Let l(x) be the function that returns the cell into which point x £ R. D falls. We define our 

random feature mapping x i—»• z(x) £ M c as 

*(x) :=(I(J(x) = h),..., I(l(x)=lc)) T 


This is simply the indicator vector of the partition cell into which x falls. In particular, it contains a 
single non-zero entry. The inner product between two datapoints in the feature space defined by z is 


z{x) T z{x!) = I (Z(x) = l(x')) 


1 if x, x! are in the same partition cell 
0 otherwise 


Observe that two datapoints x, x! fall into the same cell if and only if the Mondrian sample has no cut in 
the minimal axis-aligned box B({x, x'}) containing x and x'. By self-consistency, the probability of this 
happening is the same as that of running a Mondrian process A4 with the same lifetime A on the box 
B({x,x'j) and not observing any cuts. Thus 


P( 2 (x) T 2 (x') = 1 ) = P (A4' contains no cuts) 

= P(first cut time in M' is > A) |x 2 — x’ 2 \ 

= P (Exp (LD(B({x, x'}))) > A) 

= P(Exp(||x-x'|| 1 ) >A) 

= exp (—A||x — x'Hi) 


x' 

B({x,x'}) 


\xi - a/J 


So inner products in our random feature space are Bernoulli random variables and their expectations 
are precisely the kernel values we want to approximate. To decrease the variance of the approximations, 
instead of using a single Mondrian, we may sample M independent Mondrians and obtain the randomized 
feature mapping z(x) by concatenating feature vectors Zi(x), ..., Zm(* ) from the M Mondrian samples. 
We normalize this vector by M~ 1//2 , so that 


z(x)\ r /z(xQ\ 

y/M J \\/M) 


1 

M 


z(x) T z(x') 


1 M 

— J2 z m{x) T Z m (x!) - > E[ 2 i(x) T 2 ! 1 (x')] =e A||x x|k 


As a Monte Carlo estimate, the convergence to the Laplace kernel as M —► oo is at the standard rate, 
i.e., the standard deviation of the estimator decreases as 
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Empirical evaluation 

We check experimentally that as the number M of Mondrian samples increases, the performance of 
the resulting regression model approaches the performance of a model using the exact Laplace kernel. 
The training data size Wrain = 3000 is chosen so that we see a computational benefit from using our 
approximation but an exact C?(fV t 3 rain ) computation is still possible given enough resources. Recall that 
the asymptotic time complexity of one approximate computation is 0(C 2 N ), where C is the number of 
dimensions of the random feature space produced by z. 

We have also implemented another approximation scheme called Random Binning |[8j to compare 
against our Mondrian approximation. Random Binning has a hyperparameter playing a similar role as 
M that affects the number of random features produced. 


RMSE vs number of random features 
dataset: CPU (N train = 3000, iV tot = 818) 



RMSE vs number of random features 
dataset: census (N train = 3000, JV test = 1000) 



Figure 3.1: Convergence to exact kernel regressor as number C of generated features increases. The 
left plot uses (a random subset of) the CPU dataset, while on the right (a random subset of) the 
Census dataset was used. CPU and Census are the two regression datasets used in [8] where Random 
Binning was introduced. The horizontal axis shows the number of random features C produced by the 
approximations, while the vertical axis shows the RMSE of the resulting regression model on a validation 
data set. The horizontal green line (with dotted standard deviation estimates) is the RMSE obtained 
when using the exact Laplace kernel. We see that the Mondrian approximation and Random Binning 
need to generate a similar number of random features to achieve the same predictive performance, and 
that this performance approaches the performance of the exact classifier as the number of random features 
increases. Random Binning is perhaps slightly more sensitive to randomness, as indicated by occasionally 
much larger standard deviations for both the number of features produced and the validation set RMSE. 

As we shall see in the following chapter, the main advantage of the Mondrian approximation (over, 
say, Random Binning) is that it can be efficiently evaluated for all possible lifetimes A of the approximated 
kernel in a given range A € [0, A]. With Random Binning the approximation needs to be reconstructed 
from scratch for each new lifetime value. 


3.3 Comparison with Mondrian forest regression 

We have presented two non-linear regression models utilizing Mondrians: the Mondrian random forest 
and the Mondrian approximation of the Laplace kernel. In both models we independently sample M 
partitions of the data points at hand, but these partitions are then used differently. In Appendix B 
on Model interpretation we briefly discuss the theoretical similarities and differences between these two 
models. We show that they are both linear smoothers, that they coincide in the case M = 1 and show 
that for M > 1 they can be interpreted as approximating two different quantities. 



























Regularization paths 


The statistical complexity of many machine learning models can be controlled by adjusting their hyper¬ 
parameters. In Mondrian process based models this role is played by the lifetime A of the Mondrian, 
which controls the complexity of the generated partitions. 

Suitable hyperparameter values for modeling the dataset at hand are usually found by cross validation, 
a technique of splitting the dataset into a training set 2? t rain := {(xi, j/i),..., (xAr train , 2/Ar train }) and 
a validation set V va \ := {(xjv train +i, 2/Ar tr ai„+i)> • • ■ > i x N, Un)}, training several models on the former 
and choosing the hyperparameter values giving the best performance on the latter. As this procedure 
contaminates the validation set, we usually preserve an untouched test set on which the performance of 
the model with the finally chosen hyperparameters can be more accurately estimated. In the following 
we implicitly assume that such an independent test set is always preserved. 

Training several models with different hyperparameters is often daunting and computationally expen¬ 
sive, especially if for each new hyperparameter configuration the model needs to be trained from scratch. 
It would be desirable to reuse some parts of the computation with one set of hyperparameters for training 
and evaluating the model with a new set of hyperparameters. The notion of computing entire regular¬ 
ization paths 19] takes this idea to the extreme: it trains and evaluates the model for all possible values 
of a regularization hyperparameter at essentially the cost of training and evaluating a single model. 

In this chapter we outline how this can be done with the lifetime parameter A in Mondrian process 
based models. We focus on regression, but the ideas also apply to classification and density estimation. 

General setup 

All our Mondrian models start by generating M Mondrian samples to provide M partitions of the data 
points. Recall that each cut in each Mondrian is associated with a birth time 0 < < A, where A is 

some terminal lifetime until which the Mondrians are sampled. Let K be the total number of cuts in all 
M samples combined, and let 0 < fq < • • • < tx be an ordered list of their times (the values are distinct 
with probability 1). 

Given the cuts in the Mondrian samples, the model is deterministic. So as time increases from 0 to 
A and new cuts appear in the trees, the model only changes at the K time instants when a cut is added 
to one of the trees. To be able to efficiently compute the entire regularization path over the lifetime, i.e. 
to train and validate the model for all lifetimes A £ [0, A], we need to be able to perform the following 
operation efficiently: 

(01) Given the model trained and evaluated with lifetime tj, compute the model trained 
and evaluated with lifetime ti+ 

Sometimes we will find it easier to traverse the regularization path backwards, which is to say that 
we train and evaluate the model with the maximal lifetime value A and then efficiently compute the 
results for all smaller values of the lifetime in decreasing order. For that, efficient way of performing the 
following operation is required: 

(02) Given the model trained and evaluated with lifetime ti, compute the model trained 
and evaluated with lifetime tj_ 1 

Performing operations (01) or (02) in Mondrian random forest models for classification, regression 
or density estimation turns out to be quite simple, as outlined in the next section. It will be slightly more 
challenging for the Mondrian approximation of the Laplace kernel since the Mondrians do not directly 
make predictions, they only provide a randomized feature mapping. 


4.1 Mondrian random forest 

We traverse the regularization path forwards, starting with lifetime A = 0 and performing operation 
(01) whenever a new cut appears in any of the M trees. Apart from the regression trees themselves, we 
maintain two global quantities: the mean squared error on a validation dataset MSE £ M and the vector 
y £ whose n-th entry is the forest prediction at the n-th data point. We initialize all entries of y to 
the mean of the predictive prior and compute the resulting MSE in time O(N). 

Suppose that at time ti a leaf l in tree m is split into two new child leaves l±, ^ 2 - As predictive 
distributions in individual leaves are independent, the predictions only change for data points in l. The 
posterior predictive distributions in leaves h, I 2 can be computed analytically in time linear in the number 
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of datapoints that end up in these leaves (see Example |0.7| . To see how the model predictions y and the 
MSE are updated, suppose that x„ is a point originally in l that ends up in, say, leaf l\ after the split. If 
yi is the mean of the predictive distribution in l before the split and is the corresponding quantity in 
li after the split, the global prediction of the forest at point x n can be updated as 


tin tin 


M 


M 


since the prediction is simply the average from the M trees. The corresponding update of the MSE is 


MSE' <- MSE - 


{tin - Vn) 2 

N 


{tin - Vn) 2 

N 


Note that these updates take constant time per datapoint in each split, so maintaining the predictive 
distributions, y and MSE only multiplies the running time by a constant. Hence the entire regularization 
path is computed at essentially the same cost as training and evaluating the model at the terminal lifetime 
A. Finally, note that the RMSE can at any time be easily computed as RMSE = VMSE. 


Examples 

Below we show examples of regularization paths for regression (on the left) and for density estimation 
(on the right). The validation set RMSE function is computed using the above described procedure and 
so is the training set RMSE function after simply taking 2? va i = 2? t rain- 


RMSE vs lifetime A (regularization path) 
dataset: CCPP (A trairl =5740, N test = 3828) 



log-likelihood vs lifetime A (regularization path) 
dataset: lris_Setosa (N train = 28, N test = 22) 



Figure 4.1: Mondrian random forest regression 
regularization path. As expected, the red training 
set RMSE decreases as the flexibility (complexity 
of partitions) of the model increases, while the 
green validation set RMSE reaches a minimum 
and then increases as the model starts to overfit 
to the noise in training data. Both curves are 
piecewise constant, with jumps occurring only at 
times when a cut appears in one of the M trees. 
This may not be clearly visible only because of 
the large number of cuts. The regression dataset 


used here is Combined Cycle Power Plant 10 


Figure 4.2: Mondrian random forest density es¬ 
timation regularization path. The regularization 
path for Mondrian random forest density estima¬ 
tion can be computed by essentially the same pro¬ 
cedure as for regression, with the exception that 
instead of predictions and mean squared errors 
(MSE) we maintain the likelihoods of each data 
point. For density estimation it is also possi¬ 
ble to efficiently compute the leave-one-out log- 
likelihood. All log-likelihoods are normalized for 
the number of datapoints. As expected, the val¬ 
idation and leave-one-out likelihoods peak and 
then decrease as the more flexible model starts 
overfitting the training data. The dataset used 
here is the Setosa class form the Iris dataset. [6] 
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4.2 Laplace kernel approximation 

In the Mondrian approximation of the Laplace kernel each datapoint x is encoded as a (normalized) 
concatenation z(x) of M indicator vectors 2 y(x),... ,zjf(x), where z m (x ) indicates which partition cell of 
the m-th Mondrian the point x falls into. When a new cut appears in one of the Mondrians, a partition 
cell is split into two. This corresponds to replacing the feature associated with this cell by two new 
features, one for each child cell. Conversely, when traversing the regularization path backwards and a cut 
is removed from one of the Mondrians, the two features corresponding to the merged cells are replaced 
by their sum. (The sum of indicators of disjoint sets is the indicator of their union.) 

Recall that the random feature representations of the datapoints are organized in the feature matrix 
Z £ M. NxC . Each column of Z corresponds to one feature, i.e. one partition cell in one of the M Mondrian 
samples. Adding new features amounts to appending new columns to this matrix, while summing two 
features corresponds to summing the corresponding columns. Both operations can be carried out easily 
in O(NC) time. The challenge lies in the fact that the predictions of the model are a non-trivial function 
of Z: the prediction y at a test point z* = z(x*) is given by y = 0 T z*, where 0 is the ridge regression 
solution 

0 = (Z T Z + <5 2 I c )- 1 Z T y 

When columns of Z (features) are added, removed or summed we need to efficiently update the inverse 
(Z T Z + S 2 ^)- 1 in order to compute predictions under the new feature mapping. To this end, the 
next subsection reviews a set of general tools for efficiently updating matrix inverses under specific 
perturbations of the matrix that is being inverted. 


Matrix inverse updates 

Lemma 4.1 (Sherman-Morrison-Woodbury inversion formula). For any matrices A £ R" xn , U £ R raxfe , 
C £ R fexfe , V £ R fexn with A and C invertible, if the matrix C -1 + VA _1 U is invertible then 

(A + UCV )" 1 = A ’ 1 - A - 1 U(C -1 + VA _ 1 U) _ 1 VA ^ 1 

Proof. Direct computation yields (A + UCV)(A - 1 - A- 1 U(C - 1 + VA^U^VA" 1 ) = I„. □ 

An important special case of this formula allows us to update a matrix inverse after a rank-one update, 
i.e. after the addition of a rank -1 matrix uv T : 


Corollary 4.2 (Rank-1 matrix inverse update). Let A £ 
be column vectors. If 1 + w 1 A ^ 0 then 


l nxn b e an invertible matrix and let u, v £ 


uv 


, -l 


= A' 1 — A -1 u(l ■ 


- v t A 


i s_i T » -i . _1 A 1 uv t A 
u) v A = A - 


1 + v T A _1 u 


Proof. Apply the Woodbury inversion formula (Lemma|4.l|) with U = u, V = v T and C = 1. 


□ 


If A 1 is known, by bracketing the numerator in Corollary 4.2 as (A 1 u)(v i A 1 ) we can compute 


the inverse of the rank-1 updated matrix A + uv T in time 0(n 2 ), as opposed to the 0(n 3 ) running time 
of matrix inversion from scratch. Note that the following operations are all rank-1 updates: 

• Adding the j-th row r j to the i-th row r;. This amounts to replacing r i with r.; + r j , which can be 
expressed as the addition of uv T with u = e,; and v = rj . 

• Adding the j-th column Cj to the z-tli column cThis amounts to replacing Cj with Cj + c j, which 
can be expressed as the addition of uv T with u = c j and v = e*. 

• Add a constant a £ R to the i-th entry on the main diagonal. This can be expressed as the addition 
of uv ] with u = aei and v = e,. 
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Lemma 4.3 (Inverse of a submatrix). Let A £ R rax " be invertible, let 1 < i < n and let A be the matrix 
obtained from A by deleting its i-th row and i-th column. Let E be the submatrix of A ^ 1 obtained by 
deleting its i-th row and column, let f be the i-th column of A -1 with the i-th entry removed , let g be the 
i-th row of A- 1 with the i-th entry removed, and finally let h be the (i,i) entry of A -1 . 

If h ^ 0 then A is invertible and its inverse is A -1 = E — fg T /h. 

Proof. The proof appears as Lemma |A.14| in the appendix. □ 


This lemma is sufficiently general for us, since the rows and columns of matrices that we will be 
inverting correspond to features in the same order and so we won’t need to remove the i-th row and j-th 
column for i ^ j. Also, note that given A -1 , the update A -1 = E — fg T /h can be computed in 0(n 2 ) 
time as E, f, g and h can be easily extracted from AW 


be invertible. For b £ c £ 


Lemma 4.4 (Inverse of an extended matrix). Let Ael 
d £ M, the extended matrix 

A b 

c T d 

is invertible if and only if its Schur complement s := d — c 1 A _1 b ^ 0, in which case the inverse is 


and 


"A 

b' 

-1 

’E 

f' 

T 

C 

d _ 


T 

|_g 

h 


where 


E = A -1 + s - 1 A -i bc J A 


Ja-i 


g = -s 1 c t A 1 
This inverse can be computed from A -1 in time 0(n 2 ). 


f = -s-'A^b 
h = s -1 


Proof. Appears as Lemma |A.15| in the appendix. 


□ 


Removing a cut 

To demonstrate different concepts, we choose to traverse the regularization path backwards for this model. 
As discussed in the introduction of this chapter, we train and evaluate the model for some terminal lifetime 
value A and then seek to efficiently revert each cut one-by-one (operation (02)), in decreasing order of 
their birth times. 

Suppose we want to revert the effect of cut c with birth time tf, appearing in the m-th Mondrian 
sample. We assume we have access to the feature matrix Z and the inverse CO 1 = (Z 2 Z + <5 2 Ic) -1 
corresponding to features generated by the Mondrians with lifetime tb (i.e., with the cut c present in the 
m-th Mondrian). Let i, j be the indices of the two features introduced by the cut c. Reverting this cut 
amounts to merging these two features together and since they are (rescaled) indicators of disjoint sets, 
this is equivalent to summing the i-th and j-th columns zj, z j of Z together. Thus our goal is to obtain 
the updated inverse 

C- 1 = (Z, T Z + 6%)^ 

where C = C — 1 and Z is the matrix obtained from Z by replacing its i-th and j-th column by their 
sum. (The sum replaces the i-th column, and the j-th column is removed, say.) 

We express the operation of computing C from C as a sequence of four operations in such a way 
that after performing each individual one the resulting matrix is still invertible and the inverse can be 
computed in time (D(C 2 ) using the above introduced tools. 


1 : 

Add row j to row i 

> Rank 1 update, Corollary 

4.2 

2 : 

Add column j to column i 

> Rank 1 update, Corollary 

4.2 

3: 

Delete the j-th row and j-th column 

> Inverse of a submatrix, Lemma 

4.3 

4: 

Subtract S 2 from the (i, i) entry 

> Rank 1 update, Corollary 

4.2 


It is important to carry out steps (3) and (4) in this order, so as to guarantee existence of the inverse 
after each step. The matrix remains invertible after steps (1) and (2) because adding a row to another 
row (or a column to another column) is an elementary operation that preserves the rank of the matrix. 
The inverses after performing steps (3) and (4) are guaranteed to exist because the resulting matrices are 
in both cases positive definite, as can be easily checked. 

Note that we do not in fact require maintaining the matrix C; it suffices to maintain C ” 1 and Z as 
they contain all that is required to perform the updates to C — . 
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Implementation 

We start by computing the Mondrian approximation of the Laplace kernel with the terminal lifetime 
A. This produces M Mondrian trees, each consisting of a hierarchy of cuts with birth times t;, € 
[0, A]. We traverse through these cuts in decreasing order of birth time, at each birth time removing 
the corresponding cut. The previous section describes how the matrix C _1 can be appropriately updated 
in time OiC 2 ). Having access to this updated inverse, predictions on the validation set can be made via 
y = z^9, where 9 = C -1 (Z T y). Using the shown bracketing, the parameter vector 9 can be computed 
in time 0(C 2 + CN train ). The predictions on W va i validation points and the resulting RMSE can then be 
computed in time 0(N val C). Hence the total time complexity of removing a single cut is 0(C 2 + CN). 
As there are C — M cuts to be removed before lifetime 0 is reached, the time complexity of traversing 
the entire regularization path from A down to 0 is 0(C 3 + C 2 N). This is the same as the cost of training 
and evaluating the initial model with lifetime A. 


RMSE vs lifetime A (regularization path) 



Figure 4.3: Regularization path of Laplace kernel approximation using M = 50 Mondrian trees. The 
piecewise constant function shows the validation set RMSE of the resulting regression model as a function 
of the lifetime. In this configuration the regularization path seems to be sensitive to randomness, since 
large jumps occur at times when particularly well placed cuts appear in one of the trees. The dataset 
used is (a random subset of) CPU, one of the regression datasets used in [8]. 

Conclusion 

Initially we have introduced the Mondrian approximation of the symmetric Laplace kernel as a way of 
avoiding the computationally expensive inversion of an N x N (regularized) kernel matrix. While this 
reason still holds, efficient computation of the entire regularization path of this approximation leads to 
another use case: even if a single 0(N 3 ) computation with exact Laplace kernel is feasible, we may 
want to use the Mondrian approximation to efficiently find a value of the lifetime that performs best on 
the validation set. As the lifetime can be seen as controlling model complexity, this yields an efficient 
procedure for determining a suitable model complexity for the dataset at hand. When using the exact 
Laplace kernel, we would probably need to retrain the model from scratch for several values of the lifetime, 
each value requiring a new 0(N 3 ) matrix inversion. 





Mondrian Grid 


We have seen how the Mondrian process is useful for approximating the symmetric Laplace kernel, sharing 
a common lifetime A for all D input dimensions. Moreover, we have seen that the entire regularization 
path over the lifetime A can be efficiently computed, leading to efficient determination of the right model 
complexity for a dataset at hand. In this chapter we seek to achieve the same goal with a more general 
Laplace kernel where different dimensions are allowed to have different lifetimes: 

Definition 5.1. The (general) Laplace kernel is given by 

/ D 

k(x,x) = exp I - ^2 ^d\xd ~ x' d \ 

V d= i 

where Ai,..., Ad are lifetime parameters of the kernel. 

If we were only interested in training a single model with a fixed lifetime configuration A = (Ai,..., Ad), 
we could still use the previous Mondrian approximation after rescaling each input dimension d by A^ and 
then using a symmetric Laplace kernel of lifetime 1. However, we are interested in an efficient procedure 
for the cross validation problem 


argmin error) (Ai, ...,X D ), X> va i) 


The Mondrian process has the limitation that when stopped at a single lifetime A, this lifetime is common 
to all dimensions. Therefore the earlier presented Mondrian approximation of the Laplace kernel is only 
useful for cross validation if ratios of lifetimes in different dimensions can be fixed. In this chapter we 
propose a method that does away with this requirement, allowing us to adjust the approximated lifetime 
in each dimension independently. This model is no longer based on a D-dimensional Mondrian process, 
but rather on D independent one-dimensional Mondrian processes (which have been shown to coincide 
with Poisson point processes in Theorem 1.10). 


Mondrian grid approximator 


A Mondrian grid is a collection of D independent one-dimensional 
where M^ is assumed to run on the d -th coordinate axis of R 33 . 

Suppose we sample a Mondrian grid, which is to say that 
we sample from independent one-dimensional Mondrian processes 
along each coordinate axis, say until a lifetime A d in dimension d. 
The cut locations x['^ < • • • < x ( 'P of M^ provide a partitioning 
of the d-th coordinate axis, which in turn yields a partitioning of 
R 33 by hyperplanes orthogonal to the d-tli coor dinate axis, crossing 
it at the cut locations of (See Figure [HTT] for a 2D illustration, 

where these hyperplanes are dashed lines.) 

The cuts induced by all the D Mondrian samples together parti¬ 
tion R 33 into cells, maximal connected subsets of R 33 not intersect¬ 
ing any cutting hyperplane. Unlike in a D-dimensional Mondrian 
process, these cuts extend all the way through space, uninterrupted 
by cuts in different dimensions. Hence the name Mondrian grid. 

Let C be the number of non-empty grid cells. Similarly as with 
the Mondrian approximation of the Laplace kernel, our random fea¬ 
ture mapping z : R 33 —> R c maps each datapoint x to an indicator 
vector z(x) of the grid cell into which x falls. Dot products in this 
feature space are then 


z ( x ) T z ( y) 


1 if x, y fall into the same grid cell 
0 otherwise 


Mondrian processes ,..., M^ D \ 


d= 2 



Figure 5.1: A sample of a Mondrian 
grid in 2 dimensions. The point 
is a cut location in the sam¬ 
ple from the Mondrian M^ d \ the 
dashed lines show the corresponding 
cuts of R 2 . Points a and b are in the 
same grid cell, whereas the point c 
falls into a different one. Thus here 


N = 3 and C = 2. 
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Using independence of the processes ..., M I ' D) on each axis we have that 

/ D \ 

nz(x) T z(y) = 1) = P n {no cut between Xd and yd in dimension d } I 
\d=l / 

D 

= P( no cut between Xd and yd in dimension d) 
d= 1 
D 

= ]^[ P (Exp (\x d - 2/d I) > 

d= 1 
D 

= exp (-Ad|a?d - 2/rfl) 

d—1 

= exp ^ \ d \x d - 2/d|^ 


[ independence] 

[ self-consistency ] 


As with the previous Mondrian approximation, inner products are a Bernoulli random variables with 
expectations equal to the desired kernel values and by concatenating feature vectors Zi(x ),..., Zm(x) 
from M independent grids into a single feature vector z(x) (and normalizing with Af -1 / 2 ) we obtain the 
Monte Carlo estimator 


s/m) \Vm) 


i 

M 


z(x) T z(y) 


1 

Z rn(*) T Z m (y) -> E [z\ (x) T 2 i(y)] = e~ X d\x d -Vd\ 


whose standard deviation decreases as 0(M x / 2 ) as M —> oo. 


5.1 Regularization paths 

With the Mondrian grid approximation we can adjust the approximated lifetime in dimension d individu¬ 
ally, by changing the lifetime Ad of the one-dimensional Mondrian on the d-th coordinate axis. Moreover, 
it is not necessary to discard the existing grid sample when such a change is made; we only need to 
add or remove cuts in dimension d according to whether A d was increased or decreased. In this section 
we discuss how the predictions of the resulting regression model (using the feature mapping z) can be 
updated when such a change of lifetime in an individual dimension is performed. 

Initialization 

For each dimension d, let N ( [ be the number of distinct d-coordinates of all N data points (training and 
validation combined) and let x^ < • • • < xffl be a sorted list of their values. 

Observe that the Mondrian grid approximation only depends on how the datapoints are partitioned 
into cells by the grid. It does not depend on the number of cuts that separate two points, or on the precise 
location of these cuts. More concretely, the partitioning of the datapoints only depends on whether there 
is or isn’t a cut in the interval for each 2 < i < N^ and each 1 < d < D. So if for each such 

interval (x[ d } 1 ,x[ d ^) we compute the birth time of the first cut t[ d ^ appearing in it, the set of cuts that 
yield a grid approximating the Laplace kernel with lifetimes A = (Ai,..., Ad) is given by taking the cuts 
with birth times t ^ < A d in dimension d. Then by including or removing some of these cuts we will be 
able to easily change the lifetimes of the approximated Laplace kernel. 

By self-consistency of the Mondrian process, the distribution of the time of the first cut in an interval 
(x[ < ^i,x[ d ' > ) is Exp(5- d ^ — We sample this quantity for each such interval M times independently, 

once for each grid. There is no need to sample the exact locations of the cuts, but they would of course 
be uniformly distributed in the interval. 
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d= 2 



l l l l 


Figure 5.2: Example of unique coordinate values x for D = 2 dimensions and N = 6 data points. 
Distributions of the time of first cut in each interval are also shown. 


Before traversing a regularization path along lifetime configurations, we need to pick a starting con¬ 
figuration A 0 = (Aoi,..., Aon). For this lifetime configuration we compute the feature matrix Z € R NxC 
in time 0{NC), where C is the total number of non-empty grid cells in all M grids, where each grid 
consists of those cuts in each dimension d that have birth time % < \od■ We also compute the inverse 
C _1 = (Z T Z + (5 2 Ic) -1 using any standard method in time 0(C 3 ). 


Example 5.2. A natural initialization point might be the lifetime configuration Ao = 0, in which case 
all M grids contain no cuts and so all datapoints fall into the same cell in each grid. Then Z is an 
N x M matrix with all entries equal to 1 j\[M (each entry z nm indicates that the n-th datapoint falls 
into the only grid cell in the m-th grid) and the regularized covariance matrix C = Z t Z + 5 2 I m has all 
non-diagonal entries equal to 


E 


n—1 


l l 

Vm Vm 


N 

M 


and all diagonal entries equal to jj + S 2 . Its inverse C -1 = (Z T Z + 5 2 1 m )~ 1 can be computed in time 
@(M 3 ) using any standard matrix inversion algorithm. The inverse is guaranteed to exist for 6 > 0 as 
the matrix is positive definite. 


Increasing a lifetime 

Say we want to increase the lifetime in dimension d and as a result a new cut is added to the m-th grid 
in an interval for some in and i. When this cut is added to the grid, all cells intersected 

by this cut are split into two. (Note that in the standard Mondrian approximation, a cut only split one 
cell.) However, we will only split those cells where after the split both resulting grid cells will contain 
a datapoint. By carrying out this check we ensure that none of the grids contributes a feature that has 
value 0 for all datapoints. 
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Say we’ve identified that feature z, = (zu,..., z n i) (the *-th column of Z) is one of those features that 
need to be split by the newly added cut. We construct the two sets 


So~- 

jl < n < N \ 

Zni ^ 0} %nd ^ 1J 

Si := ■ 

|l < n < N\ 

Zni '- > 0) %nd ^ ^ 


of indices of datapoints to the left and to the right of the newly added cut, respectively. Note that even 
though the location of the new cut is not determined exactly within x^), this is sufficient because 

no datapoint has d-coordinate lying in this open interval by definition. The feature vectors corresponding 
to the two new grid cells are then 

z' :=-^(I(1€5 0 ),...,I(JV€So)) 


Now we need to remove feature z add the two new features Zq, z( to the matrix Z and update the 


( 2 ) 


inverse C _1 = (Z 1 Z + S 2 Ic)~ L accordingly. We also need to compute new predictions y on the validation 
set and determine the resulting RMSE. 

(1) Deleting the i- th column of Z and appending two new columns Zq, z( to the end can be performed 
easily in time <D{NC). (This allows for reallocating memory for Z if necessary.) 

The i-th row and *-th column of the regularized covariance matrix C correspond to the removed 
feature, so we would delete this row and column from C. Lemma |4.3| on the inverse of a submatrix 
tells us how to update C -1 when the i- th row and column of C are deleted, in time 0{C 2 ). 

The two new features Zq, z\ appended to Z manifest themselves as two new columns and two new 
rows appended to C, where each new entry is a covariance between two features, except for the 
two new diagonal entries which are the variances of the two new features plus the S 2 regularization 
terms. Lemma 4.4 on the inverse of an extended matrix tells us how to update C _1 when a new 
row and column are added to the end of C, in time 0(C 2 ). We apply this procedure twice, first for 
feature z' 0 and then for z(. 

Note that we only need to update C” 1 , there is no need to maintain the matrix C itself. 

(4) Given the updated inverse C -1 , the ridge regression solution 0 MAP i s 


0 


= C ZtiainYtr 


and can be computed in time O(CN) by bracketing the expression as 0 MAP = C" 1 (Z^. ain y tr ai n )- 
(5) Given the updated ridge regression solution 0 MAP ; predictions on the validation set and the resulting 
RMSE can be easily computed in time 0{CN). 

We repeat steps (l)-(5) for each feature that is split into two non-empty features by the newly added cut. 
If K is the number of such features, adding this cut takes 0(K(C 2 + CN)) time. 


Decreasing a lifetime 

Now suppose we want to decrease the lifetime along a dimension d and as a result a cut disappears from 
the m-th grid in an interval (x , x ^), for some m and i. At this stage all pairs of cells that have only 
been separated by this cut need to be merged pairwise together. Thus the problem of decreasing the 
lifetime decomposes into two parts: 

(1) Determining which pairs of features should be merged. 

(2) Merging (summing) those pairs of features and correspondingly updating the matrices Z, C _ , the 
model predictions and the resulting validation set RMSE. 

To solve (1), for each grid cell we maintain a pointer to both its neighbours in each of the D dimensions. 
When a cut in dimension d is removed, for each cell we check whether it is this cut that separates it from 
one of its neighbours in dimension d. If so, these two features are to be merged. 

Once (1) is done, merging a pair of features can be performed simply by removing the two features 
from Z and then appending a feature that equals their sum. We have already seen in the previous 
subsection how Z, C _1 and the predictions can be efficiently updated in time 0(C 2 + CN ) when features 
are added or removed. 
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5.2 Lifetime configuration exploration 

In the previous section we have described how a regularization path can be traversed by efficiently 
increasing or decreasing the lifetime in one dimension individually. However, our ultimate goal is to 
discover a configuration of lifetimes A = (Ai,..., Ad) for the D input dimensions that works well for the 
dataset at hand. As before, we split the dataset into a training set V tra in = {(xi, y \),..., (x 7 v train , 2/Af train }) 
and a validation set 2? V ai = {( x Jv traiI1 +i> 2/JVtram+i)> • • • > (xjv, Vn)} and seek a configuration of lifetimes that 
minimizes the RMSE on the validation set 2? V ai of a model trained on 2?train using this configuration. 

We propose to use the Mondrian grid approximation, where evaluation of the validation set RMSE 
for different lifetime configurations can be performed more efficiently than recomputing it for each con¬ 
figuration individually. This is because having trained the model (computed the matrices Z and C” 1 ) for 
one lifetime configuration, moving to a neighbouring lifetime configuration can be done efficiently using 
the methods described in the previous section. 

To decide which lifetime configuration to explore next based on the history of already explored config¬ 
urations, we need an optimization procedure. The following is a very simple local optimizer that greedily 
increases the lifetime in the dimension that leads to lowest immediate RMSE on the validation set: 


l: Save Z and C -1 

2: for d = 1 to D do 

3: Cd f— first cut in dimension d with birth time strictly larger than Xd 

4: Z (d), -f- updated matrices after adding the cut Cd 

5: Cd <— error on the validation set from Z^), C7^ 

6: d m ove argmin^^ e d 

7:Z^Z W ), C _1 4 — C7. 1 . 

(.amove,)! (amove) 


Validation set RMSE 

M= 1, 5=0.10, data: toy(A fram =21, N test = 15) 



Validation set RMSE vs number of optimization steps 
0.0'n F= 1 -. l5=al0 - data: N ^ = 15 } 


0.035 


0.030 


0.025 


0.020 


0.015 


12345678 
number of optimization steps 


Figure 5.3: Greedy optimization procedure on a toy 2D dataset, where all lifetime configurations can be 
evaluated quickly. Nodes of the surface on the left show the validation set RMSE after a given number 
of cuts has been added in each dimension. The nodes are joined together into a continuous surface 
for clarity, but note that in fact the RMSE surface is piecewise constant with jumps only when a cut 
is added/removed. The green line in the left figure shows the path taken by the greedy optimization 
procedure, started from the origin A = (0,0) (with no cuts added in either dimension) in the top right 
corner. The orange nodes show the lifetime configurations considered by the optimization procedure 
for its next step. On this particular toy dataset the greedy optimizer manages to discover the global 
minimum. The plot on the right plots the height of the green optimization path (validation set RMSE) 
as a function of the number of optimization steps performed. 
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Validation set RMSE 



Figure 5.4: Another toy dataset with two input dimensions, but this time the second dimension d = 2 is 
irrelevant for predicting the target value. The greedy optimization procedure recognizes this and increases 
the lifetime predominantly in the relevant dimension d = 1, as indicated by the green path on the left- 
hand figure. However, due to randomness in the data, occasionally it also increases the lifetime in the 
irrelevant dimension. As this greedy optimization procedure only increases lifetimes, such increase can 
never be reverted in the future even it if would lead to lower validation set RMSE. 

The toy experiment shown in Figure [fo4] suggests that the Mondrian grid approximator could also be 
used for basic feature selection. After the optimization procedure discovers a good lifetime configuration 
A = (Ai,..., A d), a collection of predictive features (input dimensions d) can be obtained by selecting 
those for which A d > £, where e > 0 is some small threshold. 


Validation set RMSE vs number of optimization steps 
data: 3D Road Network_0(iV irai „ = 10000i, N teat =1000L) 



Lifetime configurations path 
data: 3D Road Network_0(iV tra ;„ = 10000L, N test =1000L) 
M=100, <5=0.10 



ifetime A, in dimension d= 1 


Figure 5.5: Greedy optimization of Mondrian grid lifetimes on the real-world 3D Road Network dataset 


11 with D = 2 input dimensions. The left-hand plot shows development of validation set RMSE as the 


optimization progresses and the right-hand plot shows the corresponding evolution of the lifetimes Ai, 
A 2 of the two input dimensions. Note that for the training data size N train = 10000 used here the cubic 
running time of exact computation with the Laplace kernel would be prohibitive, especially if several 
different lifetime configurations were to be tried out. The Mondrian grid approximation with our greedy 
optimization procedure efficiently discovers a good lifetime configuration. (But note there is no guarantee 
that it is globally optimal among all possible lifetime configurations.) 
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Our greedy optimization procedure described above only considers adding cuts (increasing lifetimes). 
However, in the previous section we have also explained how the lifetime in one of the dimensions can be 
efficiently decreased by removing a cut. To see this procedure in action, we have implemented another 
simple greedy optimization procedure that also considers removing a cut in each dimension individually 
before deciding which neighboring configuration to explore next. 


Validation set RMSE 

M= 1, 5 = 0.10, data: toy(7V tram =21, N test = 15) 
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Figure 5.6: Greedy optimization procedure on a toy 2D dataset, where the procedure is also allowed to 
remove a cut (decrease the lifetime) if it leads to lower validation RMSE than any cut addition would. 
A problematic aspect of this greedy optimizer is that it gets easily stuck in local minima. 


5.3 Further work 


There seems to be great room for improvement in the optimization procedure used for deciding which 
lifetime configuration to explore next. We have implemented two greedy optimizers, one that increases 
the lifetime in the dimension leading to lowest validation set RMSE and another that also considers 
decreasing the lifetime. We can easily envision using more sophisticated local optimization algorithms for 
finding minima of the validation set RMSE as a function of lifetime configuration. For example, instead of 
looking one step ahead in each dimension, we could compute K > 1 steps in each direction before deciding 
in which dimension to increase the lifetime. Another interesting method to try could be Simultaneous 
Perturbation Stochastic Approximation (SPSA), which doesn’t require access to the gradient of the 
optimized function and works even in presence of noise in the function measurements. The validation set 
RMSE as a function of the lifetime configuration is not differentiable and the measurements we get are 
noisy due to the noise in training and validation datasets. 

Another approach we may take is to systematically model the validation RMSE as a random (un¬ 
known) function by placing a prior distribution on it (e.g., a Gaussian process), treating the explored 
lifetime configurations as (noisy) observations of this function and computing the posterior distribution 
of the function. This posterior could then be used to guide our decision which lifetime configuration to 
explore next. 

To move between different lifetime configurations we have proposed making efficient updates to the 
e inverse of the regularized feature covariance matrix C = Z r Z + 5 2 Ic- Instead of 
working with matrix inverses directly it is often suggested for numerical stability reasons to work with 


matrix C 1 


the Cholesky decomposition 12 . Even though we have not run into numerical issues in our experiments, 


we outline how the Cholesky decomposition could be used in Appendix D. 
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Selected proofs 


A.l Exponential distribution and exponential clocks 


Proposition A.l. The expectation of the Exp(A) distribution is t. 
Proof. Integrating by parts, 


f xp(x\X)dx= f \xe Xx dx 

Jr Jo 


Xx—e~ Xx 
—A 


r°° i . 

X—e~ Xx dx = 0 + 
i —A 


0 -Ax 


-A 


J o 


1 

A 


Proposition A. 2. Let Z be any real-valued random variable that is a.s. non-negative and possesses 
lack of memory property, i.e., 


□ 

the 


Vs > 0, t > 0 P (Z — t>s\Z>t) = P(Z > s ) 

Then Z ~ Exp(A) for some X > 0. 

Proof. Define G : [0, oo) — > [0,1] to be the tail function G(t) := P (Z > t). Then G is a decreasing 
function with G(0) = 1 and the assumed lack of memory property gives us the functional equation 

G(t + s) = P(Z > t + s) = P(Z - t > s\Z > t)P(Z >t)= P (Z > s)P{Z >t) = G(s)G(t) 


for all s, t > 0. The rest of the proof is concerned with solving this functional equation. 

For n € N we have G{nt) = G(t + (n — 1 )t) = G{t)G{{n — l)t), so by an easy inductive argument we 
obtain G(nt ) = G(t) n . As t is arbitrary here, we can take t = ^ to get G(l) = G(^) n . Noting that G is 
a non-negative function, taking the ?r-th root gives G(^) = G(l) 1 /™ for all natural n. 

Suppose g=^eQ, where m, n € N. Then by the already established results 

G(g) = G = G (m^j = G Q) = G(l) n/m = G(l)« 

Now suppose 2 € (0, oo). By elementary analysis, we can always find a sequence (q n ) of rational numbers 
in (0, z) approximating z from below (i.e. q n f 2 as n —> oo). As G is decreasing and limits preserve weak 
inequalities, 

G(z) < lim G(q n ) = lim G(l) 9 " = G(l) z 

n—> oo n—too 

by continuity of the exponential. Similarly we can consider a sequence of rationals approximating 2 from 
above to deduce the opposite inequality G(z) > G(l) z . Hence G{z) = G(l) 2 for all 2 G (0, 00 ). 

It follows that G(l) > 0 (otherwise we’d contradict the assumption P(Z > 0) = 1) and we can define 
A = — lnG(l) > 0. Then for all 2 > 0 we can write G(z) = e~ MZ and we see that indeed Z ~ Exp (/x). □ 

Lemma A.3 (Lack of memory property). Let Z be an exponential random variable and T an independent 
nonnegative random variable. Then Z has the lack of memory property at the random time T, i.e. 

Vw P(Z-T > u\Z >T) =P{Z > u) 


Proof. For negative u both sides evaluate to 1, so in the following we may assume u > 0. 
Let A be the parameter (inverse mean) of the exponential Z. For any a > 0 we have 


P(Z > R + a) 



e 


— A a 


P(Z > T + a Z = z)fz(z) dz 

[ conditioning on Z ] 

P (T <z-a) f z {z) dz 

[ independence of T and Z \ 

P(T < 2 - a)Xe~ Xz d 2 

[ T is non-negative and a > 0 

P(T < x)Xe~ x{x+a) dx 

[ substitution x = 2 — a] 

/>oo 

/ P(T < x)fz(x)dx 

Jo 
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Using this calculation with a = u and a = 0 we get as required, 


P (Z — T>u\Z>T) = 


P (Z - T > u, Z > T) P (Z -T>u) e 


— Xu 


A u 


P (Z > T) 


P (Z-T> 0) e 


-AO 


= P(Z > u) □ 


Proposition A.4. Suppose X \,..., X n are independent exponential random variables with rates (inverse 
means) Ai,..., X n . Then min Aj ~ Exp(J^ A,). 


Proof. For any t > 0 we have by independence of the AjS that 

P(minX i >t)= P (Q{A z > t}) = JJP(X< > t) = JJ 


e Xit = exp 


(-<£• 


and P(min X, > t) = 1 for t < 0, so indeed the minimum has the claimed distribution. D 

The case of two competing exponential clocks is treated by the following theorem. The statement is 
taken from a problem sheet accompanying my Applied Probability course, the proof is my solution to 
that question. 

Theorem A.5 (Two competing exponential clocks). Let X and Y be independent exponential random 
variables (competing exponential alarm clocks) with respective parameters A and p. Let 


W = min{A, Y}, 


Z = max{A, Y}, O = Z - W, 


M = 1 {X<Y} = 


if X <Y 
if X >Y 


(a) Calculate P (W > s) and P (M = 1). Identify the distributions of W and M. Show that the events 
{W > s} and {M = 1} are independent. 

(b) Express the event {W < w,M = 1, O < t} in terms of X and Y and calculate its probability. What 
is P(W < w,M = 0, O < t) ? Show that W and (M, O) are independent. 


Proof, (a) By Proposition A.4 W ~ Exp(A + p,) and therefore P(W > s) 


Conditioning on the value of X we have 


_ g— (A +u)s. 


poo 

P(A < Y) = / P(A' < Y|A' = x)fx(x) da; [conditioning on the value of A] 
Jo 

P(Y > x)\e~ Xx da; [by independence of A' and Y] 

, — [since P(Y > x) = e~^ x ] 

A + p 

so P (M = 1) = and P (M = 0) = 1 — P(M = 1) = j+JI- other words M ~ Ber(^^). 
To show independence of the events {W > s} and {M = 1}, we just check that 



POO 

P(W > s, M = 1) = / P(min{A, Y} > s,X < Y\X = x)f x (x)dx 

J o 


= / 0dx+ P(Y > x)\e~ Xx dx 

Jo j s 

= (1 _ g-(A+ M ) S \ 

A f/i ) / 

= P(M = 1)P (W > s) 

(b) By definition of our random variables 


[conditioning on X] 

[independence of A, Y] 

[as P(Y >x) = e~ MX ] 
[by above] 


E := P(W <w,M =1,0 <t) = P(min{A, Y} < w, X < Y, max{A, Y} - min{A, Y} < t) 

= P(A < w,X <Y,Y — X < t) 
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Conditioning on the value of X (which has to lie in (0,xe) if the event E is to occur), 

nW 

F(E) = / P(X < Y, Y — X <t\X = x)fx(x) dx [conditioning on the value of X] 

Jo 

f>W 

= / P(x < Y < x + t)Xe~ Xx da; [by independence of X and Y] 

Jo 

= J (e~» x - e -^ x+t) ^j Xe~ Xx da; [as P(F > a) = e~^ a for a > 0] 

Observe that if we repeated the same calculation with the roles of X and Y (and hence of A and /a) 
swapped, we’d be calculating the probability P (W < w,M = 0 ,0 <t) and the result would be the same 
except that A and /x would be swapped). Defining a(0) = /x and a(l) = A, our findings can be expressed 
compactly as 

P (W <w,M = m,0 <t) = (l - e- 0 * 1 "”*)*) (l - e ~ (x+ ^ w ^ 

For any w,t > 0 and m G {0,1} we then get (recalling that W ~ Exp(A + /x)), 

P (W G [0, w], (M, O) G {m} x [0, t[) = P (W < w, M = m,0 <t) 


a ( m ) f —a(l —m)t\ 

l _ e -(A+M)^ 

X + p, \ ) 

- 


= P (W <oo, M = m, O < t)P{W < w ) 

= P((M, O) G {m} x {0, t})P(W G [0, w]) 

As w,t,m were arbitrary, this is sufficient to conclude independence of W and ( M,0 ). 


□ 


A.2 Bayesian Gaussian model 


Proposition A. 6. Under the prior p{p) = A/ r (/x|/x 


priori & prior. 


r ) and likelihood p(y\n) = A/ r (j/|/Lt, cr^ oise ), the 


posterior after collecting N independent observations V = {j/i,..., y n } is 

P (P \V)=Af(p | Pernor + Pnoise En=l Vn ^ + ^ 
\ Pprior i Pnoise 


where p pr ior = g prior an J Pnoise = & noise are prior and noise precisions, respectively. 

Proof. As the posterior distribution is known to be a probability distribution, it suffices to work up to 
proportionality (oc) and normalize at the end: 

N 


p{ia d )= p{u) n^i,) 


oc exp — 


(P Pprior) (Un P) 


2cr 2 • 

prior 


n—1 noise 


N 


OC exp \ | PpriorP ‘^PpriorP Pprior ^PnoiseP ^ ^ Un H - P noise XP 


Pprior “1“ Pnoise-^ ( Pprior Pprior H” Pnoise ^rj = 1 V n 

oc exp ^ ^--- P - - 


Pprior + X Pnois 


\ r ( | PpriorPprior “1“ Pnoise Xm=l 2/™ / , at \ — 1 

OC N j p | -f-- — -, (Pprior + A Pnoise) 


Pprior + N Pnois 


□ 
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A.3 Poisson point process 


Lemma A.7. Let II be a Poisson point process on [a, b] with constant intensity A. Conditionally given 
that the process generated N = n points, their locations are i.i.d. uniform in [a, b}. 


Proof. We follow the proof given in 3 . Let Ai,..., A^ be a partition of [a, b] and let ni,... ,Uk be 
integers with m + • • • + nk = n. Writing N(Ak) := |II D A k \ and m{A k ) for the Lebesgue measure of A k , 
we have by definition of conditional probability 


K 


P| {N(A k ) = n k } | N = n\ = 


P (nf=i {N(A k ) = n fc }) 


\k—l 


F(N = n) 


nf =1 P (N(A k )=n k ) 


P (N = n) 

] K p -\m{A k ) (\m(A k )) n i° 
n k \ 


n K 

k= i e 


a -A m([a,b\) Arra([o.,b])^ 


K 

n 


nu • • • ,n K J V™([a, 6 ]) 


i(A k ) 


(ii) in Definition 1.8 


(i) in Definition 1.8 

K 

^m(A k ) = m{[a,b\) 

k =1 


We recognize this as the multinomial distribution with unnormalized parameters m^Af ),..., m{Ax). This 
distribution can be represented as each point being independently assigned to region A k with probability 
mUc^fcl) » so the n Points are (conditionally) independent. As the partition {A \,..., A k ) was arbitrary, the 
distribution of the point locations is uniform. □ 


A.4 Conditional Mondrians 

Lemma A.8. Suppose we are conditionally given that the restriction M ^ of a Mondrian process M ~ 
MP{ A, 0) with lifetime A € [0, oo) to a smaller box $ C 0 is trivial (contains no cuts). Then 

• with probability exp(A (LD(Q) — LD(' f>)), M is also trivial 

• with complementary probability 1 — exp(A(LD(@) — L_D(<f>)) the first cut in 0 misses 4>, its time has 
the truncated exponential distribution with rate LD{Q) — LD($>) and truncation at A, and the cut 
location is uniformly distributed along the segments where making a cut doesn’t hit 4>. 

Proof. Let T be the time of the first cut of M. By Bayes’ rule, its conditional distribution is 

p(T = t | M* = 0) = =®\ T = t ) 

By definition of the Mondrian process T ~ Exp(LD(0)) and p(M * = 0) = p(MP(A, 4’) = 0) = e -ALD ( <I b 
using self-consistency. Finally, the probability that is empty given that the first cut in 0 occurs at 
time t can be obtained as 

p(M $ = 0 | T = t) = p(first cut misses <E> | T = t)p(M® = 0 | T = t, first cut misses 4>) 

= LD(0) - LD($) (A _ t)LDW 
LD(0) 

where the second equality again uses self-consistency. Plugging into the Bayes’ formula 

(T = t I M* = m = LD(Q) exp(—LD(0)f) LD(0) - LD(4>) (A _ t)LDW 
P{ 1 ’ exp(—ALD(4>)) LD(0) 

= (LD(0) - LD(4>)) exp (-(LD(0) - LD(4>))f) 
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We see that T | (M $ = 0) ~ Exp(LD(@) — LD(<f>)). The probability that this time is within the lifetime 
A of the Mondrian is 1 — exp(A(LD(0) — LD(<1>)) and conditionally on being within the lifetime, the 
distribution becomes truncated at A. 

For brevity of notation, define the event A := {first cut of M occurs outside $ at time T = t}. The 
conditional distribution of the location X of the first cut is, using Bayes’ formula, 

p(X = x\ M * = M) = ^ I X = x,A) 


Given that the first cut occurs outside $, the density of its location is p(X = x\A)~ (LD(0) — LD($)) _1 . 
By self-consistency p(M $ = 0 | X = x, A) does not depend on the value of x and therefore this probability 
equals the denominator p(M rJ> = 0 | A). Hence 


p(X = x | M' J> 


it), A) = p(X = x | A) 


1 

LD(0) — LD(<1>) 


We see that as advertised, the conditional distribution of the cut location is uniform (among cut locations 
that don’t split <I>). □ 


A.5 Ridge regression 


Theorem A. 9. MAP parameter estimation of G in the linear model 

0 ~ Af(0, crl rior l D ) 

y = G 1 x + £ where s ~ A/"(0, a 2 noise ) 


is equivalent to L 2 -regularized least-squares, i.e., to minimizing the function 

N 


m :=<5 2 ||0||2 + E(y--Xn 0 ) 2 


where S := anoise . 

Cf prior 

Proof. The prior on 6 can be written as 
p(G) =J\f(0,al liol I D ) = 


( 27r ) 2 knrior^l 2 


I eX P ( -^ T ( Cr prior I £>) 10 ) = 


{2nal 


exp 


mi 

2<j 2 . 

prior 


By independence of noise in different observations, the likelihood function of the observed data V as a 
function of the parameter 6 factorizes as 


N 


N 


N 


C(G\V) =p{V\G) = Y[p(y n K,0) = []% | xlG,al oise ) = —= 

71=1 71=1 71=1 V * 


1 


: exp 


{yn-y^ef 
2al 


The posterior distribution p(G\T>) is proportional to the product of the prior p{ff) and the likelihood 
C{0\D). The MAP estimate of 6 is obtained by maximizing this posterior, which is equivalent to mini¬ 
mizing its negative likelihood: 

llfllj 2 N ( v _ y Tfl)2 

— \np(6\V) = — In p(0) — \nC(6\D) + const = - 2 2 + ^ ^ 2 ” -t- const 

u prior n=l u noise 


Multiplying this equation by the positive quantity 2cr^ oise > 0 and defining 6 := ^ noise , we can equivalently 
minimize the following function of 6: 

N 

m :=< 5 2 || 0 || l + J2(yn-XnO) 2 

71=1 


As advertised, this is the standard least squares minimization problem with an L 2 regularization term. □ 
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Theorem A.10. For S > 0, the L 2 regularized least squares objective function 

f(G) = S 2 \\0\\l + \\y ~ X6\\l 

has a unique minimum at 9 = (X T X + <5 2 l£>) _1 X T y. 

Proof. Expanding the definition of f(0) we obtain 

f(9) = 6 2 6 t 9 + (y - X0) T (y - X0) 

= 6 2 9 1 9 + y T y — 9 T X T y — y T X9 + 9 T X T X9 [ distributivity ] 

= const + S 2 9 j 9 — 2y T X9 + 9 1 X T X9 [ 9 T X T y and y T X9 are scalars ] 

Having expressed f{9) in terms of matrices, we now recall from matrix calculus that if A is a matrix 
constant with respect to 9 then = A 7 and de d ^ 9 = 2A 7 9. Therefore 

= 2<5 2 0 - 2X T y + 2X T X0 

O6 

Setting these partial derivatives (the gradient of /) to 0 yields the so-called normal equation 

(X T X + S 2 I D )9 = X T y 

Observe that for any v G R D \ {0} we have v T (X T X + (5 2 Id)v = || Xv ||| + 5 2 ||v|| 2 > 0, so the matrix 
(X T X + (5 2 l£>) is positive definite, therefore all its eigenvalues are positive and hence it is invertible. Thus 
we can pre-multiply the normal equation by the inverse of this matrix to obtain an explicit form for the 
location of this stationary point: 

0 = (X T X + <5 2 I D )- 1 X T y 

That this value of 9 is a minimum of / can be confirmed by computing the Hessian matrix of second 
derivatives H = 2<5^I + 2X T X, which is easily seen to be positive definite. Hence / is strictly convex, 
implying that the critical point of / found above must indeed be a unique global minimum. Q 

Theorem A.11. The MAP estimate 0 AIAP of the ridge regression parameter can be also expressed as 

0 MAP = X T (XX T + <5 2 Ijv) -1 y 


Pro of. Fo r the right-hand expression, we start by rewriting the normal 
rem A. 10 as X T X0 -J- = TC t v • rparrn/ncnncr anrl HiviHincr Wv t/hp r»nc 


firm at inn from the 


9 = 5~ 2 (X T y - X T X0) = X t <T 2 (y - X9) 


This still involves 9 on the right-hand side, but we can obtain an expression for X9 from the normal 
equation: pre-multiplying it by X gives 

XX T y = X(X T X + S 2 I d )9 = XX T X9 + S 2 X9 = (XX T + S 2 I N )X9 

Arguing similarly as before, the matrix XX T + <5 2 Ia? is positive definite and therefore invertible for any 
6 > 0, so we may write 

9 = X t 5~ 2 (y - X9) 

= X t 5~ 2 [y - (XX T + 6 2 I N )- 1 XX T y] 

= X t 5~ 2 [y - (XX T + i5 2 Iiv) _1 (XX T + ,5 2 I w )y + S 2 {XX T + 5*1*)-^] 

= X T (XX T + < 5 2 r v )- 1 y 


Note that we’ve used an ’’add and subtract trick” to obtain the third equality. 


□ 
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A.6 Updating matrix inverses 


We start with a lemma showing how the inverse of a matrix A changes when the last row and column of 
A are deleted. A more general case is deduced afterwards. 


Lemma A.12. Let A G 


l nxn be invertible with inverse of the block form 
A-' = 


E f 

? T h 


where E G 


i— l)x (n— 1) 


, f e 


pn—1 


, g e 


pn—1 


and h G R. Let A be the matrix obtained from A by 


deleting its last row and column. If h 7 ^ 0 then A is invertible with inverse A 1 = E — fg 7 /h. 


Proof. Write A = 


A b 

c T d 


. As A A 1 = I n . by properties of matrix multiplication 

AE + bg 7 = I„_i 

Af + bh = 0 

Dividing the second equality by the scalar and solving for b gives b = — Ai/h. Plugging this into 

the first equality yields 

In-! = AE - Afg T //i = A (E - fg T /h) 

This proves that A is invertible and its inverse has the stated form. □ 

Now we deduce a more general result, where it is the i-th row and i-th column of A that are deleted. 

Definition A.13. Let a be a permutation of {1,2,..., n}. The permutation matrix P a G R nx " is the 
monomial matrix with (i,j) entry equal to I(cr(i) = j). 

Observe that for A G R nx ”, pre-multiplication P CT A permutes the rows of A using o~ 1 1 while post¬ 
multiplication AP^ permutes the columns of A using a. Note also that (P <T ) -1 = (P CT ) T = P a ■ 

Lemma A.14 (Inverse of a submatrix). Let A G R nxn be invertible, let 1 < i < n and let A be the 
matrix obtained from A by deleting its i-th row and i-th column. Let E be the submatrix of AT 1 obtained 
by deleting its i-th row and column, let f be the i-th column of A -1 with the i-th entry removed, let g be 
the i-th row of A -1 with the i-th entry removed, and finally let h be the (i,i) entry of A -1 . 

If h ^ 0 then A is invertible and its inverse is A -1 = E — fg 7 jh. 

Proof. Using the cycle notation, define the permutation a = (i i + 1 • • • n). Note that the inverse of this 


permutation a 1 sends i to n. Let 4> : 




(n i)x(n 1 ) b e the function that deletes the last row 


and last co lumn of a matrix, and let if be the function that updates its inverse accordingly (as dictaded 
by Lemma A.12), i.e. = ()>(X) _1 . Note that the matrix A can be equivalently obtained from 

A by sending the i-th row and column to the last positions and then applying the function <f>, so that 
A = ((>(P cr AP 0 ' -1 ). Thus 

A " 1 = ^P-AP -" 1 )- 1 = ^ ((P-AP-- 1 )- 1 ) = (p cr A~ 1 P <T 

The right-hand side tells us that to compute A -1 , we may move the i-th row and column of A to 
the last positions and then apply the procedure from Lemma |A.12[ But that yields precisely that 
A ~ 7 = E — fg T /h when h ^ 0. □ 

The next lemma goes in the reverse direction, showing how the inverse changes when a new row and 
column is appended to the original matrix. 


Lemma A.15 (Inverse of an extended matrix). Let A G K raxn be invertible. For b G R n , c G 
d G R, the extended matrix 

A b 


and 
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is invertible if and, only if its Schur complement s := d — c T A 3 b 7 ^ 0, 

A b' 

c T d 


— _L 

E 

f 


T 

g 

h 


where 


E 

g 


= A' 


+ s 1 A ■'■be 3 A 


—s 1 c 1 A 


in which case the inverse is 

- l f = -s~ 1 A- 1 b 
h = s _1 


This inverse can be computed from A 1 in time 0(n 2 ). 

Proof. First suppose that s / 0. The stated forms for E, f, g, h could be derived from the equations 


A b 


E 

f 


E 

f 


A b 

c T d 


T 

|_g 

h 

= I„+i and 

T 

|_g 

h 


c T d 


but given that we have the forms stated, it suffices to verify that they do solve one of these two equations 
(which then implies that the other is also satisfied). Indeed, for the first equation we have 


A b 

c T d 

I 

c 1 A ' 1 

In 0 

0 T 1 


L A 1 bc 3 A 


— s 1 A 3 b 


_ g .A. ^ s ^ 

, + s _ 1 bc T A _1 — bs~ 1 c 3 A ' 1 
+ c T s - 1 A - 1 bc 3 A -1 — ds _1 c 1 A -1 

In+l 


—s 3 b + bs 


—c s 


T 1 A -1 


A 3 b + ds 


where the definition of the Schur complement s has been used to simplify the two terms in the bottom 
row to get the last row. 

Conversely, if the inverse matrix exists, then Af + hh = 0 and c T f + dh = 1. From the first 
equality we obtain f = — A - 1 b/i and plugging this into the second gives — c T A _ 1 b/i + dh = 1. Thus 
s = — c T A _1 b + d 7 ^ 0, as required to prove the reverse direction of the claim. 

Finally, suppose that A -1 , b, c and d are known. Then s = d — c T (A - 1 b) can be computed in 
time 0{n 2 ) and then h = s -1 is immediate. Also, f = —^(A^ 1 !)), then g = —h(c ri A~ x ) and finally 
E = A -1 — f(c 3 A -1 ) can all be computed in 0(n 2 ) time. □ 



Model interpretation 


In this report we have considered several different models for non-linear regression: Mondrian random 
forest, kernel ridge regression and a Laplace kernel approximation using the Mondrian process. In this 
chapter we show that under certain conditions all three models are so-called linear smoothers and hint 
at what the fundamental difference between these models is. 


Definition B.l. Let V = {(xi,yi),..., (x-NjUn)} be a training dataset. A predictor y for a new test 
point x* is called a linear smoother if it can be expressed as a linear combination of the training 
responses, i.e., 

N 

V = 'y ' k e (x* i x n , X) y n 

n= 1 


The coefficients fc e (x*,x n ,X) are allowed to depend on the locations of all the training datapoints, but 


not on their response values y n 

kernel 


The function k e is called the smoothing matrix or the equivalent 


13 


Proposition B.2. Mondrian random forest regression is a linear smoother if and only if the prior 
predictive distribution in the leaves Af(p pr ior, &p r ior) has y vr ior = 0 or <J pr i or = oo (no prior). 


Proof. Let p pr ior := <7p rior be the prior precision and let p no i se be the observation noise. For 1 < m < M , 
let lm(x) be the function that returns the leaf of the m-th Mondrian tree associated with the partition 
cell into which points x falls. The prediction y at a point x* can be expressed as 



m—1 


PpriorMprior 


Ppr 


"1“ Pnoise X^n=l (** ) — ^m(Xn))2/n 

Pnoise = ^m(Xfc)) 


We see that the constant term not involving y n disappears if and only if /i pr i or = 0 or p pr i 0 r = 0 (no 
prior). In that case the prediction is 


N 


M 




Pnoise^Gm (x* ) 


n—1 


m—1 Pprior H"~ Pnoise — ^m(Xfc)) 


revealing a linear smoother with smoothing matrix 


M 


uMF 

€ M 1 


x*,x, 


- M 5E 






□ 


Remark. The M Mondrian trees of a Mondrian forest are sampled independently, so by the law of large 
numbers, as M — > oo, the obtained smoothing matrix converges at the standard rate to 


k Mt (x*,x„,X)=E 




P pri 


+ Y)k =1 H(^m( x *) = lm(X-k)) 


Ignoring the prior (setting p pr ior = 0), this quantity can be described as the expected proportion of the 
n-th datapoint in the leaf containing the prediction point x*. As we would expect, this is a quantity that 

• is increased when the distance between x* and x n is small 

• is decreased when there is a large number of training point in the vicinity of x* 
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Proposition B.3. Kernel ridge regression is a linear smoother for any valid kernel function k. 

Proof. Expanding the prediction y at a test point x* by writing k(x*,X) for the row vector of length N 
with n-th entry equal to /c(x*,x n ), 

y = x*0 

= k(x*, X)(K + (5 2 Ijv) -1 y 

N N 

= E £ k ( X *’ X *) (( K + i2l ^ _1 ),n 

n—1 i—1 

We see that kernel ridge regression is a linear smoother with smoothing matrix 

N 

k e (x*,x n ,X)=Y / k(x,,x i )((K + 5 2 I JV )- 1 ). n □ 

i—l 

Remark. Interpreting the smoothing matrix of kernel ridge regression is made difficult by the presence 
of the matrix inverse (K + <5 2 Iv) _1 - 

B.l Mondrian forest vs Laplace kernel approximation 

We have presented two models for non-linear regression that utilize Mondrian trees: the Mondrian forest 
regression model and the Mondrian approximation of the Laplace kernel. In both models we independently 
sample M Mondrian trees with finite lifetime A > 0 and use them to obtain M independent partitions of 
the datapoints X at hand. 

However, the two models are also clearly different in some way. With Mondrian forest regression, the 
prediction at a point x* is made by computing the prediction from each of the M trees independently 
and then simply averaging them. There is no interaction between the different Mondrian trees. On the 
other hand, with the Mondrian approximation of the Laplace kernel, we solve a ridge regression problem 
to find a parameter vector 6. There can be non-trivial dependences between all entries of this vector, 
including those corresponding to features produced by different Mondrian trees. Thus in this model, we 
can have interaction between the different Mondrian trees. 

The case M=1 

In this subsection we show that under a slight condition, the two regression models coincide in the case 
M = 1. We will write N(i) for the number of training datapoints falling into the i-th leaf of the single 
Mondrian tree that is sampled. 

Let us consider the Laplace kernel approximation model first. Recall that the ridge regression solution 
can be expressed as 

d = (Z T Z + S 2 I c )- 1 Z T y 

where Z G M. NxC is the training data feature matrix, y are the training data targets and C is the number 
of random features created. The prediction at a new test point z* is given by y = 0 T z*. In the case 
M = 1 we have that 

• Z G M. NxC is a binary matrix, with the n-th row containing precisely one non-zero entry z nc , 
indicating that the n-th datapoint falls into leaf c 

• C = Z T Z + <5 2 Ic G K c ' xC ' has (i, j)-entry Cy equal to, for i j, 

N N 

Cij = (Z T Z).. + <5 2 0 = E z niZnj = E 0 = 0 

n—1 n=l 

and for i = j, 

N N 

Cii = (Z T Z) u + <5 2 1 = ZniZni + 8 2 = Z n i + S~ = N (i) + S 2 

71=1 71=1 

Thus C is a diagonal matrix, with 7-th diagonal entry being the number of datapoints in the 7-tlr 
leaf, plus the S 2 regularization term. 
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• C 1 G R c ' xC ' as an inverse of a diagonal matrix is also diagonal, with i-tli diagonal entry 

1 


c^ = 


N(i) + S 2 


• Z T y G K c has i-th entry a* equal to 


N 

: ^ ' Zin.yn 

n— 1 


This is the sum of target values y n corresponding to datapoints that fall into leaf i. 
• 0 = C -1 (Z T y) G has *-th entry 


- C ii a i ~ + p X] ZinVn 

With (5 = 0 this would be the average target value among datapoints in leaf *; with <5 > 0 this can 
be interpreted as having an additional observation of 0 with weight (5 2 . 

• y = 0 T z* G R. can, by letting c be the leaf of the Mondrian tree into which z* falls, be expressed as 

t 1 N 

y = 6 Z* = ^ 9 C Z* C = 0c= jy,~> Z (5)nyn 

c—C ' n—1 

Hence the prediction y is simply the training average of targets in the leaf of the prediction location, 
regularized by a <5 2 -weighted prior observation of 0. 

This prediction y is the same as the one made by the Mondrian forest regression model with a single 
tree, provided that its hyperparameters are chosen compatibly with 6 2 : the prior predictive distribution 
A/”(0, cTp rior ) must be zero-centered and together with the observation noise A/”(0, cr 2 oise ) they must satisfy 
the relation 

2 

Tioise e2 

2 ^ 

^prior 

This confirms that in the case M = 1, the Laplace kernel approximation approach coincides with the 
Mondrian forest regression model with a zero-centered predictive prior and suitable hyperparameters. 
Since we’ve shown that the former computes a random feature space in which the inner products have 
expected values equal to the Laplace kernel, our model equivalence implies that the same is true of 
Mondrian forest regression (with M = 1), which can therefore also be though of as an approximator for 
the Laplace kernel. 


The general case 

In Mondrian forest regression, we sample M Mondrians in order to decrease the variance of the predic¬ 
tions directly, by averaging the final predictions from M trees. This increases the probability that the 
predictions will be closer to their expectation. 

In Laplace kernel approximation using Mondrian trees, we sample the M Mondrians in order to 
decrease the variance of the kernel approximation. This increases the probability that inner products in 
the random feature space will better approximate the Laplace kernel. 

Another way of thinking about the similarities and differences between these two models can be gained 
by interpreting them as linear smoothers. 

Linear smoothers 

We’ve already seen that Mondrian forest regression is a linear smoother for any value of M, provided 
that the predictive prior has mean zero. We’ve also seen that the kernel ridge regression model (with any 
valid kernel) is also a linear smoother. Very similarly, it turns out that the Mondrian approximation of 
the Laplace kernel is also a linear smoother, for any value of M. 
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Proposition B.4. The Laplace kernel approximation using M Mondrian trees is a linear smoother. 
Proof. Expanding the formula for the prediction y at a new test point z*, 

y = ^o 

= zf (Z T Z + (5 2 Ic) -1 Z T y 

N C C 

= £ Vn £ £(Z T Z + PictizL 

n= 1 c=l k—1 

N C C 

= £^£-c£(z-z^ 2 i c) c - fe ^ □ 

n=l c=l /c—1 

We can read off the smoothing matrix k e ( z*,z n ,Z) = z *c Efc= i(Z T Z + 5 2 Ic) cfc 1 ^nfe, but the 

presence of the matrix inverse makes if difficult to analyse it in the general case (we’ve seen that for 
M = 1 the inverted matrix is diagonal). 

Now we can summarize the difference between Mondrian forest regression and Laplace kernel ap¬ 
proximation as follows. In Laplace kernel approximation we use M independent Mondrian samples to 
approximate the Laplace kernel. As the prediction is not a linear function of the kernel (it involves a ma¬ 
trix inverse), the prediction doesn’t decompose with the M trees and therefore there can be a non-trivial 
interaction between the different trees. On the other hand, Mondrian forest regression uses M inde¬ 
pendent Mondrian samples to approximate the equivalent kernel fc MF (x*,x„,X) (the smoothing matrix). 
The prediction y is by definition a linear function of the equivalent kernel: 

N 

V = £ fc MF (x*,x„,X)y„ 

n= 1 

Hence for Mondrian forest regression, the prediction necessarily decomposes with the M Mondrian sam¬ 
ples (there is no interaction between the different trees). 



Density estimation 


Definition C.l. The Beta distribution with shape parameters a,b > 0 has probability density function 

p(x\a,b) = ~ x )^ 1 0 < a; < 1 

r(a)r(6) 

where T(f) = J 0 °° z t ~ l e~ z dz is the gamma function (a shifted generalization of the factorial function). 
Proposition C.2. The expectation of the Beta(a,b) distribution is 
Proof. The expectation can be calculated from definition: 


E[Beta(a, 6 )] = 


- x r'= 


! 0 T(a)r(6)' 


^T(a+l + b) x a {l _ x)b _ 1= a 


a + bj o T(a + 1 )T( 6 ) 


a + b 


by recognizing the integral over the Beta(a + 1,5) distribution which must evaluate to 1. 

Density estimation differs from regression and classification in that it is an unsupervised problem, 
no labels are observed in training data. 


□ 

i.e., 


Definition C.3. Density estimation is the problem of learning a probability density p from a set of 
training samples V = {xi,... ,xjv} C M 15 generated from p. Given a new test point x*, the learned 
density p estimates p(x*) for the true density at point x*. 

A Mondrian random forest model for density estimation needs to be able to predict density values 
in its leaves, noting that a probability density needs to integrate to 1. Following the approach taken for 
classification, we use a hierarchical Bayesian model. In each sample of the Mondrian process, we associate 
each node (not just the leaves) with an unknown probability mass, with the constraint that each non-leaf 
node’s mass must equal the sum of masses associated with its two children. The prior distribution over 
these masses is as follows: 

• With probability 1, the root (of depth 0) is associated with a probability mass of 1. 

• Say n is a non-leaf node of depth d, with associated probability mass P n . Let c\, C 2 be the children 
of n and let Vj, V -2 be the volumes of the boxes in associated with ci and C 2 - Under the prior, 
the probability masses P n i, P n 2 associated with the children are then assumed to be generated as 
follows: 


Beta ( 7 (d + l) 2 ,y (d + l ) 2 


Ui + U 2 


Vi +u 2 


Pnl — ~n Pn 

Pn2 = (1 - £n)Pn 


Here 7 > 0 is a hyperparameter of the prior to be chosen. Note that the two parameters of the Beta 
distribution are proportional to the volumes V \, V 2 , so that the child of larger volume is more likely to 
get a larger share of n’s associated probability mass P n . Also, the sum of the two Beta parameters is 
designed to be 7 (d+ l) 2 , in line with the Polya tree prior distribution presented in 14 where this choice 
leads to an a.s. absolutely continuous function in the limit d -+ 00 . 

As training data is added to the M Mondrian samples, the posterior distributions of e n can be 
computed analytically thanks to Beta-Binomial conjugacy. Under the posterior distribution, a probability 
mass P n of node n is a product of independent Beta distributions, which doesn’t have a simple analytic 
form, but its expectation can be easily computed. If n is in fact a leaf, we assume the mass P n to 
be uniformly distributed in the box associated with n. Note that this again requires knowledge of this 
volume, if a density is to be predicted at point lying in n. 

In regression and classification, the Mondrian samples are used only to partition the datapoints. In 
our density estimation model we also require knowing the volumes of the partition cells generated by the 
Mondrians. This proves to be a limiting factor for the computational performance of the model, since 
self-consistency cannot be invoked as easily as in regression and classification: we need to know how 
far individual partition cells extend, which might be far from any training data. We have explored two 
possible solutions to this problem: 
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(1) Identify a bounded box around the training data and assume that all probability mass lies in 
that box. Within this box we instantiate the Mondrian samples completely, i.e. even in regions 
containing no data points, since some of these cuts may still affect the volume of a box that is 
non-empty. 

(2) Rather than having an exact value for the volume of each box, we may compute its probability 
distribution under the randomness stemming from not instantiating the Mondrian samples in regions 
with no training data. Unfortunately, this distribution is somewhat more complicated than hoped: 
we believe it to be the product of independent, truncated piecewise exponential distributions. 



Cholesky decomposition 


In the chapters on approximating the Laplace kernel we’ve proposed making efficient updates to the 
matrix C _1 , the inverse of the regularized covariance matrix C = Z T Z + S 2 Ic, to efficiently explore 
the space of possible Laplace kernel lifetimes. Instead of working with this inverse directly, it is often 
suggested for numerical stability reasons to work with the Cholesky decomposition 
we sketch how this can be achieved in our setting. 


12 . In this section 


Definition D.l. Given a positive definite matrix C (i.e., v T Cv > 0 for all v 7 ^ 0), its Cholesky 
decomposition is a factorization of the form C = LL 2 , where L is a lower-triangular matrix. 


It is a standard linear algebra result that the Cholesky decomposition exists and is unique for positive 
definite matrices. 

Suppose first that the Cholesky decomposition C = LL 2 of the regularized covariance matrix C = 
Z T Z+S 2 Ic is known. The normal equation of ridge regression then reads LL 2 0 MAP = Z T y, which can be 
solved in time 0(C 2 ) by performing two back-substitution passes. Indeed, we may define 7 := L T 6 > MAP , 
solve the equation L 7 = Z T y using back-substitution (L is lower-triangular) and then solve the defining 
relation of 7 for 0 MAP using another back-substitution (L T is upper-triangular). 

Thus if we can maintain the Cholesky decomposition of the regularized covariance matrix, we will be 
able to efficiently compute the optimal ridge regression solution and hence the error on the validation set. 
Now it remains to find a way of efficiently updating the Cholesky decomposition when the regularized 
covariance matrix changes due to a change in lifetimes of the Laplace kernel being approximated. 

First consider how the Cholesky decomposition can be efficiently updated when the matrix C is 
extended, which happens when new features are created by adding a new cut into the Mondrian grid. As 
the ordering of features doesn’t matter, we may assume that new features are appended as last columns 
to the feature matrix Z. This means that we only need to consider extending C by a new last row and 
column. The equation 


C a 


E 0 


E t c 

a 2 b 


c T d 


0 d 


is equivalent to the system C = EE 2 , a = Ec and c 2 c + d 2 = b. Provided that we’ve maintained the 
Cholesky decomposition of C, the first equation is solved by taking E = L. Then as E is lower-triangular, 
we can solve for c in the second equation using back-substitution, and finally obtain d = \Jb — c T c from 
the third equation, all in time 0(C 2 ). 

It is significantly more challenging to update the Cholesky decomposition when a feature is removed, 
which happens when a feature that has been split into two by a new cut needs to be removed. The 
reason for the difficulty is that the removed feature need not correspond to the last row and column of 
C, so updating the Cholesky decomposition is not simply a matter of deleting the corresponding row and 
column from L. Instead, a two step procedure can be carried out 12 : 

(1) Rotate the rows and columns of C so that those intended for deletion end up as the last row 
and column. Update the Cholesky decomposition correspondingly, e.g. using the SCHEX 15 
subroutine from the LINPACK Fortran linear algebra package. 


( 2 ) 


Delete the last row and column of C, to which the corresponding update of L is simply to remove 
its last row and column, as can be easily checked. 

Both steps run in 0(C 2 ) time. This shows that the maintenance of the Cholesky decomposition C = LL T 
can be performed with the same efficiency as the maintenance of the inverse C _ , so indeed we can work 
with the Cholesky decomposition if so desired. 

On the datasets considered we haven’t run into numerical stability issues while working with the matrix 
inverse CU 1 directly, so the approach using the Cholesky decomposition has not been implemented. Hence 
we also omit a technical description of the SCHEX subroutine here. 
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